Comparing open dataset platforms
Three platforms with quite different purposes: Kaggle is for machine learning practice and competition data, Hugging Face is for the datasets and corpora used to train models, and repositories like Zenodo are research archives that carry a DOI and can be cited in a paper. Below we tell you which platform to use, organized by the type of data you are looking for.
The short answer
Choosing a platform by data type is the simplest approach. For machine learning competition and practice data (with a defined task and evaluation), go to Kaggle. For datasets and corpora used to train models (text, image, audio, multimodal), go to Hugging Face. For research archive data you can cite properly in a paper (with a DOI), go to Zenodo, figshare or Harvard Dataverse. The three are often used together rather than being an either-or choice.
Choosing a platform by data type
For machine learning competitions and practice data: Kaggle
Kaggle is a community platform for AI and data science. It offers hundreds of thousands of open datasets that can be browsed and downloaded for free, alongside competitions, notebooks and courses. Its strength is structured data with a clearly defined task, evaluation criteria and community discussion, which makes it well suited to practice projects, reproducing a baseline or entering a competition.
- License characteristics: each dataset's license is chosen by its publisher; common ones include CC0 and CC BY-SA. Check the license stated on the dataset page before use.
- Access from inside China: direct access is not reliable from inside China; a domestic mirror or comparable platform can be used instead.
- Citation: cite the dataset's author, name and page link, and follow that dataset's license.
For models, corpora and multimodal data: Hugging Face
Hugging Face is an AI community platform whose datasets section hosts a very large number of datasets spanning text, image, audio, video, tabular, time-series, geospatial and other modalities, with free access. It is closely integrated with models and training workflows, which makes it well suited to finding training corpora, benchmark datasets and data for specific tasks.
- License characteristics: licenses are stated by the publisher and vary widely; some datasets require accepting the terms of use or requesting access.
- Access from inside China: direct access is not reliable from inside China; a domestic mirror or comparable platform can be used instead.
- Citation: cite the dataset name, the publisher and the link, and use it under the license stated on the dataset card.
For DOI-backed research archive data: Zenodo, figshare, Harvard Dataverse
These three are research data repositories. Their shared feature is that they assign a DOI to each dataset, so it can be cited directly in a paper. They are well suited to finding the data accompanying published research, supplementary materials and research datasets meant for long-term preservation.
- Zenodo: built and operated jointly by CERN and OpenAIRE, with free upload and access, hosting data, software, papers and conference materials across all disciplines.
- figshare: an open-access repository that assigns a DOI to each item, with free upload and access; datasets are often published under Creative Commons licenses, and figures, datasets and code can all be hosted.
- Harvard Dataverse: maintained by institutions including Harvard's Institute for Quantitative Social Science, open and free to researchers across disciplines and assigning a DOI, with especially rich social science data; the underlying Dataverse is open-source software, and many institutions worldwide run their own instances.
The three types of platform, side by side
| Platform | Best for finding | License characteristics | Citation | Access from inside China |
|---|---|---|---|---|
| Kaggle | Machine learning competition and practice data | Publisher's choice (CC0 / CC BY-SA, etc.) | Author + name + link | Direct access unreliable |
| Hugging Face | Model training datasets and corpora | Stated by publisher, varies widely | Name + publisher + link | Direct access unreliable |
| Zenodo | DOI-backed research archives and software | Open or restricted, mostly open licenses | Cite the DOI | Reachable, speed varies |
| figshare | Paper data, figures and code | Mostly Creative Commons licenses | Cite the DOI | Reachable, speed varies |
| Harvard Dataverse | Social science and multidisciplinary research data | Stated per dataset | Cite the DOI | Reachable, speed varies |
How to use them together
- For a model project: take training data from Hugging Face or Kaggle, then deposit the data and code you have prepared yourself on Zenodo or figshare to obtain a DOI, which makes it easy to cite in a paper.
- For an empirical paper: prefer DOI-backed archive data (Zenodo, figshare, Dataverse), which is standard to cite and traceable; competition-style data is better suited to demonstrating a method.
- For a specific topic: search the relevant platform first, then read the plain-language explanations of high-value open datasets in our curated datasets, which saves you the time of working through English documentation.
What to do when international platforms are hard to reach
Some international platforms are not reliably reachable from inside China. In that case, you can use a platform's domestic mirror site where one exists, or switch to a comparable domestic public platform or repository. If you only need one specific dataset and would rather not deal with access issues, you can hand the requirement to us: we start with a free availability assessment, run a real search across authoritative data platforms, and judge each of your required items against what is found, item by item. Even if no fully matching dataset is found, the search directions, approximate sources and item-by-item judgments are still presented honestly for your reference.
