Where to find machine learning training datasets
The simplest approach is to pick a platform by the type of data you are after: Kaggle for practice and competitions, Hugging Face for training corpora, UCI and OpenML for classic tabular and benchmark data, Google Dataset Search to look across platforms, and Common Crawl and OPUS for large-scale corpora. Below we walk through what each one is best for, how to use it, and what to watch out for.
The short answer
Choose by purpose — Kaggle for practice and competitions; Hugging Face for training corpora and model data; UCI and OpenML for classic tabular and benchmark data; Google Dataset Search for cross-platform discovery; Common Crawl (web text) and OPUS (multilingual parallel corpora) for large-scale corpora. Confirm the license, scale and labeling quality before you download.
Choosing a platform by purpose
Practice, competitions and reproducing projects: Kaggle
Kaggle is a community platform for AI and data science, offering hundreds of thousands of open datasets along with competitions, notebooks and courses. Its strength is structured data with clearly defined tasks and evaluation criteria, which makes it well suited to practice, reproducing a baseline or entering a competition. Licenses are chosen by whoever publishes the dataset. Note that direct access from within China is unreliable. For a deeper comparison of the platforms, see our research services.
Training corpora and model data: Hugging Face
Hugging Face is an AI community platform whose datasets section hosts a vast number of datasets spanning text, images, audio, video, tabular data, time series and more. It is closely tied to models and training workflows, which makes it a good place to look for training corpora and benchmark datasets. Licenses are labeled by the publisher and vary widely; some require agreeing to terms or requesting access. Direct access from within China is unreliable.
Classic tabular and benchmark data: UCI, OpenML
The UCI Machine Learning Repository (archive.ics.uci.edu) was created in 1987, with its current site redesigned in 2023. It holds several hundred widely cited datasets, mostly tabular classification and regression benchmarks, which makes it useful for teaching, comparing algorithms and reproducing classic experiments. It is free. OpenML (openml.org) is an open machine learning platform with a large collection of datasets and standardized tasks that can be downloaded directly through its interface, making it convenient for reproduction and side-by-side algorithm comparison. It is free.
Searching across platforms: Google Dataset Search
Google Dataset Search is Google's dataset search engine. It indexes datasets across the web that carry standard metadata; it does not host the data itself but points you to the original page where the data lives. It is a good first stop when you are not sure which platform to use. It is free. Note that access to Google services from within China is unreliable.
Large-scale corpora: Common Crawl, OPUS
Common Crawl (commoncrawl.org) is a large public corpus of crawled web pages. It is very large and is commonly used for pretraining large models, and it is free to obtain. OPUS (opus.nlpl.eu) is an open collection of multilingual parallel corpora bringing together a large amount of translated, side-by-side text, which suits machine translation and multilingual natural language processing. It is free.
Platforms side by side
| Platform | Best for finding | Free? | Access from China |
|---|---|---|---|
| Kaggle | Structured data for competitions and practice | Free (some items require registration) | Direct access unreliable |
| Hugging Face | Training corpora and benchmark datasets | Free (some require agreeing to terms) | Direct access unreliable |
| UCI / OpenML | Classic tabular and algorithm benchmarks | Free | Accessible, variable speed |
| Google Dataset Search | Finding the page where data lives, across platforms | Free | Direct access unreliable |
| Common Crawl / OPUS | Large-scale corpora and multilingual parallel text | Free | Accessible, variable speed |
About Papers with Code
Papers with Code used to be a common place to look up SOTA leaderboards and their accompanying datasets. It was shut down in July 2025; the domain now redirects to Hugging Face's papers page, and the original leaderboards are no longer maintained. If you are still looking for it, switch to the platforms above that are still running.
Three things to check before downloading training data
- Whether the license allows commercial use: confirm the license type first, and check whether commercial use is allowed and whether attribution is required.
- Data scale and format: confirm the size, fields and format suit your model and compute. If a dataset is very large, consider sampling or processing in batches.
- Labeling quality and provenance: confirm the labels are accurate, the coverage is clear and the source is traceable, so dirty labels do not drag down your training.
Cannot find suitable training data? Leave it to us
If you cannot decide which platform to use, or you have searched around and found nothing suitable, you can give us your task goal and the conditions it must meet, and we run a free data availability assessment first — searching reputable data platforms for real and judging matches and gaps against each required item. Even when no perfectly fitting dataset is found, we present the search directions, approximate sources and item-by-item findings honestly for your reference.
