What is the difference between Kaggle, Hugging Face and Zenodo, and how do I choose?

Choose by data type. For machine learning practice or competitions, where you want structured data with a defined task, go to Kaggle. For datasets and corpora used to train models, including text, image, audio and multimodal data, go to Hugging Face. For research archives that carry a DOI and can be cited in a paper, go to Zenodo, figshare or Harvard Dataverse. The three serve different purposes and are often used together.

Does data on these platforms cost money?

Public datasets on Kaggle, Hugging Face, Zenodo, figshare and Harvard Dataverse can be browsed and downloaded for free; some datasets require free registration or accepting the terms of use. The platforms themselves do not charge for public data.

How do I cite datasets from these platforms in a paper?

Zenodo, figshare and Harvard Dataverse assign a DOI to each dataset, which you can cite directly; this is the most standard approach. For Kaggle and Hugging Face datasets, cite the author, the dataset name and the link, and follow the license stated on the dataset page.

Finding data · choosing a platform

Comparing open dataset platforms

Three platforms with quite different purposes: Kaggle is for machine learning practice and competition data, Hugging Face is for the datasets and corpora used to train models, and repositories like Zenodo are research archives that carry a DOI and can be cited in a paper. Below we tell you which platform to use, organized by the type of data you are looking for.

The short answer

Choosing a platform by data type is the simplest approach. For machine learning competition and practice data (with a defined task and evaluation), go to Kaggle. For datasets and corpora used to train models (text, image, audio, multimodal), go to Hugging Face. For research archive data you can cite properly in a paper (with a DOI), go to Zenodo, figshare or Harvard Dataverse. The three are often used together rather than being an either-or choice.

Choosing a platform by data type

For machine learning competitions and practice data: Kaggle

Kaggle is a community platform for AI and data science. It offers hundreds of thousands of open datasets that can be browsed and downloaded for free, alongside competitions, notebooks and courses. Its strength is structured data with a clearly defined task, evaluation criteria and community discussion, which makes it well suited to practice projects, reproducing a baseline or entering a competition.

License characteristics: each dataset's license is chosen by its publisher; common ones include CC0 and CC BY-SA. Check the license stated on the dataset page before use.
Access from inside China: direct access is not reliable from inside China; a domestic mirror or comparable platform can be used instead.
Citation: cite the dataset's author, name and page link, and follow that dataset's license.

For models, corpora and multimodal data: Hugging Face

Hugging Face is an AI community platform whose datasets section hosts a very large number of datasets spanning text, image, audio, video, tabular, time-series, geospatial and other modalities, with free access. It is closely integrated with models and training workflows, which makes it well suited to finding training corpora, benchmark datasets and data for specific tasks.

License characteristics: licenses are stated by the publisher and vary widely; some datasets require accepting the terms of use or requesting access.
Access from inside China: direct access is not reliable from inside China; a domestic mirror or comparable platform can be used instead.
Citation: cite the dataset name, the publisher and the link, and use it under the license stated on the dataset card.

For DOI-backed research archive data: Zenodo, figshare, Harvard Dataverse

These three are research data repositories. Their shared feature is that they assign a DOI to each dataset, so it can be cited directly in a paper. They are well suited to finding the data accompanying published research, supplementary materials and research datasets meant for long-term preservation.

Zenodo: built and operated jointly by CERN and OpenAIRE, with free upload and access, hosting data, software, papers and conference materials across all disciplines.
figshare: an open-access repository that assigns a DOI to each item, with free upload and access; datasets are often published under Creative Commons licenses, and figures, datasets and code can all be hosted.
Harvard Dataverse: maintained by institutions including Harvard's Institute for Quantitative Social Science, open and free to researchers across disciplines and assigning a DOI, with especially rich social science data; the underlying Dataverse is open-source software, and many institutions worldwide run their own instances.

The three types of platform, side by side

Platform	Best for finding	License characteristics	Citation	Access from inside China
Kaggle	Machine learning competition and practice data	Publisher's choice (CC0 / CC BY-SA, etc.)	Author + name + link	Direct access unreliable
Hugging Face	Model training datasets and corpora	Stated by publisher, varies widely	Name + publisher + link	Direct access unreliable
Zenodo	DOI-backed research archives and software	Open or restricted, mostly open licenses	Cite the DOI	Reachable, speed varies
figshare	Paper data, figures and code	Mostly Creative Commons licenses	Cite the DOI	Reachable, speed varies
Harvard Dataverse	Social science and multidisciplinary research data	Stated per dataset	Cite the DOI	Reachable, speed varies

How to use them together

For a model project: take training data from Hugging Face or Kaggle, then deposit the data and code you have prepared yourself on Zenodo or figshare to obtain a DOI, which makes it easy to cite in a paper.
For an empirical paper: prefer DOI-backed archive data (Zenodo, figshare, Dataverse), which is standard to cite and traceable; competition-style data is better suited to demonstrating a method.
For a specific topic: search the relevant platform first, then read the plain-language explanations of high-value open datasets in our dataset library, which saves you the time of working through English documentation.

What to do when international platforms are hard to reach

Some international platforms are not reliably reachable from inside China. In that case, you can use a platform's domestic mirror site where one exists, or switch to a comparable domestic public platform or repository. If you only need one specific dataset and would rather not deal with access issues, you can hand the requirement to us: we start with a free availability assessment, run a real search across authoritative data platforms, and judge each of your required items against what is found, item by item. Even if no fully matching dataset is found, the search directions, approximate sources and item-by-item judgments are still presented honestly for your reference.

Explore the open dataset library →

New accounts get 3 free credits (1 credit = 1 standard search); every result comes with verifiable source links.