Data-finding guide · AI and machine learning

Where to find machine learning training datasets

The simplest approach is to pick a platform by the type of data you are after: Kaggle for practice and competitions, Hugging Face for training corpora, UCI and OpenML for classic tabular and benchmark data, Google Dataset Search to look across platforms, and Common Crawl and OPUS for large-scale corpora. Below we walk through what each one is best for, how to use it, and what to watch out for.

The short answer

Choose by purpose — Kaggle for practice and competitions; Hugging Face for training corpora and model data; UCI and OpenML for classic tabular and benchmark data; Google Dataset Search for cross-platform discovery; Common Crawl (web text) and OPUS (multilingual parallel corpora) for large-scale corpora. Confirm the license, scale and labeling quality before you download.

Choosing a platform by purpose

Practice, competitions and reproducing projects: Kaggle

Kaggle is a community platform for AI and data science, offering hundreds of thousands of open datasets along with competitions, notebooks and courses. Its strength is structured data with clearly defined tasks and evaluation criteria, which makes it well suited to practice, reproducing a baseline or entering a competition. Licenses are chosen by whoever publishes the dataset. Note that direct access from within China is unreliable. For a deeper comparison of the platforms, see our research services.

Training corpora and model data: Hugging Face

Hugging Face is an AI community platform whose datasets section hosts a vast number of datasets spanning text, images, audio, video, tabular data, time series and more. It is closely tied to models and training workflows, which makes it a good place to look for training corpora and benchmark datasets. Licenses are labeled by the publisher and vary widely; some require agreeing to terms or requesting access. Direct access from within China is unreliable.

Classic tabular and benchmark data: UCI, OpenML

The UCI Machine Learning Repository (archive.ics.uci.edu) was created in 1987, with its current site redesigned in 2023. It holds several hundred widely cited datasets, mostly tabular classification and regression benchmarks, which makes it useful for teaching, comparing algorithms and reproducing classic experiments. It is free. OpenML (openml.org) is an open machine learning platform with a large collection of datasets and standardized tasks that can be downloaded directly through its interface, making it convenient for reproduction and side-by-side algorithm comparison. It is free.

Searching across platforms: Google Dataset Search

Google Dataset Search is Google's dataset search engine. It indexes datasets across the web that carry standard metadata; it does not host the data itself but points you to the original page where the data lives. It is a good first stop when you are not sure which platform to use. It is free. Note that access to Google services from within China is unreliable.

Large-scale corpora: Common Crawl, OPUS

Common Crawl (commoncrawl.org) is a large public corpus of crawled web pages. It is very large and is commonly used for pretraining large models, and it is free to obtain. OPUS (opus.nlpl.eu) is an open collection of multilingual parallel corpora bringing together a large amount of translated, side-by-side text, which suits machine translation and multilingual natural language processing. It is free.

Platforms side by side

PlatformBest for findingFree?Access from China
KaggleStructured data for competitions and practiceFree (some items require registration)Direct access unreliable
Hugging FaceTraining corpora and benchmark datasetsFree (some require agreeing to terms)Direct access unreliable
UCI / OpenMLClassic tabular and algorithm benchmarksFreeAccessible, variable speed
Google Dataset SearchFinding the page where data lives, across platformsFreeDirect access unreliable
Common Crawl / OPUSLarge-scale corpora and multilingual parallel textFreeAccessible, variable speed

About Papers with Code

Papers with Code used to be a common place to look up SOTA leaderboards and their accompanying datasets. It was shut down in July 2025; the domain now redirects to Hugging Face's papers page, and the original leaderboards are no longer maintained. If you are still looking for it, switch to the platforms above that are still running.

Three things to check before downloading training data

  • Whether the license allows commercial use: confirm the license type first, and check whether commercial use is allowed and whether attribution is required.
  • Data scale and format: confirm the size, fields and format suit your model and compute. If a dataset is very large, consider sampling or processing in batches.
  • Labeling quality and provenance: confirm the labels are accurate, the coverage is clear and the source is traceable, so dirty labels do not drag down your training.

Cannot find suitable training data? Leave it to us

If you cannot decide which platform to use, or you have searched around and found nothing suitable, you can give us your task goal and the conditions it must meet, and we run a free data availability assessment first — searching reputable data platforms for real and judging matches and gaps against each required item. Even when no perfectly fitting dataset is found, we present the search directions, approximate sources and item-by-item findings honestly for your reference.

See research services →

Further reading

Talk to us