Finding data · choosing a platform

Comparing open dataset platforms

Three platforms with quite different purposes: Kaggle is for machine learning practice and competition data, Hugging Face is for the datasets and corpora used to train models, and repositories like Zenodo are research archives that carry a DOI and can be cited in a paper. Below we tell you which platform to use, organized by the type of data you are looking for.

The short answer

Choosing a platform by data type is the simplest approach. For machine learning competition and practice data (with a defined task and evaluation), go to Kaggle. For datasets and corpora used to train models (text, image, audio, multimodal), go to Hugging Face. For research archive data you can cite properly in a paper (with a DOI), go to Zenodo, figshare or Harvard Dataverse. The three are often used together rather than being an either-or choice.

Choosing a platform by data type

For machine learning competitions and practice data: Kaggle

Kaggle is a community platform for AI and data science. It offers hundreds of thousands of open datasets that can be browsed and downloaded for free, alongside competitions, notebooks and courses. Its strength is structured data with a clearly defined task, evaluation criteria and community discussion, which makes it well suited to practice projects, reproducing a baseline or entering a competition.

  • License characteristics: each dataset's license is chosen by its publisher; common ones include CC0 and CC BY-SA. Check the license stated on the dataset page before use.
  • Access from inside China: direct access is not reliable from inside China; a domestic mirror or comparable platform can be used instead.
  • Citation: cite the dataset's author, name and page link, and follow that dataset's license.

For models, corpora and multimodal data: Hugging Face

Hugging Face is an AI community platform whose datasets section hosts a very large number of datasets spanning text, image, audio, video, tabular, time-series, geospatial and other modalities, with free access. It is closely integrated with models and training workflows, which makes it well suited to finding training corpora, benchmark datasets and data for specific tasks.

  • License characteristics: licenses are stated by the publisher and vary widely; some datasets require accepting the terms of use or requesting access.
  • Access from inside China: direct access is not reliable from inside China; a domestic mirror or comparable platform can be used instead.
  • Citation: cite the dataset name, the publisher and the link, and use it under the license stated on the dataset card.

For DOI-backed research archive data: Zenodo, figshare, Harvard Dataverse

These three are research data repositories. Their shared feature is that they assign a DOI to each dataset, so it can be cited directly in a paper. They are well suited to finding the data accompanying published research, supplementary materials and research datasets meant for long-term preservation.

  • Zenodo: built and operated jointly by CERN and OpenAIRE, with free upload and access, hosting data, software, papers and conference materials across all disciplines.
  • figshare: an open-access repository that assigns a DOI to each item, with free upload and access; datasets are often published under Creative Commons licenses, and figures, datasets and code can all be hosted.
  • Harvard Dataverse: maintained by institutions including Harvard's Institute for Quantitative Social Science, open and free to researchers across disciplines and assigning a DOI, with especially rich social science data; the underlying Dataverse is open-source software, and many institutions worldwide run their own instances.

The three types of platform, side by side

PlatformBest for findingLicense characteristicsCitationAccess from inside China
KaggleMachine learning competition and practice dataPublisher's choice (CC0 / CC BY-SA, etc.)Author + name + linkDirect access unreliable
Hugging FaceModel training datasets and corporaStated by publisher, varies widelyName + publisher + linkDirect access unreliable
ZenodoDOI-backed research archives and softwareOpen or restricted, mostly open licensesCite the DOIReachable, speed varies
figsharePaper data, figures and codeMostly Creative Commons licensesCite the DOIReachable, speed varies
Harvard DataverseSocial science and multidisciplinary research dataStated per datasetCite the DOIReachable, speed varies

How to use them together

  • For a model project: take training data from Hugging Face or Kaggle, then deposit the data and code you have prepared yourself on Zenodo or figshare to obtain a DOI, which makes it easy to cite in a paper.
  • For an empirical paper: prefer DOI-backed archive data (Zenodo, figshare, Dataverse), which is standard to cite and traceable; competition-style data is better suited to demonstrating a method.
  • For a specific topic: search the relevant platform first, then read the plain-language explanations of high-value open datasets in our curated datasets, which saves you the time of working through English documentation.

What to do when international platforms are hard to reach

Some international platforms are not reliably reachable from inside China. In that case, you can use a platform's domestic mirror site where one exists, or switch to a comparable domestic public platform or repository. If you only need one specific dataset and would rather not deal with access issues, you can hand the requirement to us: we start with a free availability assessment, run a real search across authoritative data platforms, and judge each of your required items against what is found, item by item. Even if no fully matching dataset is found, the search directions, approximate sources and item-by-item judgments are still presented honestly for your reference.

See curated open datasets →

Further reading

Talk to us