By Bo Yang, Alibaba Cloud Community Blog author.
The goal of the article is to help you find a dataset from public data that you can use for your machine learning pipeline, whether it be for a machine learning demo, proof-of-concept, or research project. It may not always be possible to collect your own data, but by using public data, you can create machine learning pipelines that can be useful for a large number of applications.
Machine learning requires data. Without data you cannot be sure a machine learning model works. However, the data you need may not always be readily available.
Data may not have been collected or labeled yet or may not be readily available for machine learning model development because of technological, budgetary, privacy, or security concerns. Especially in a business contexts, stakeholders want to see how a machine learning system will work before investing the time and money in collecting, labeling, and moving data into such a system. This makes finding substitute data necessary.
This article wants to provide some light into how to find and use public data for various machine learning applications such as machine learning demos, proofs-of-concept, or research projects. This article specifically looks into where you can find data for almost any use case, problems with synthetic data, and the potential issues with using public data. In this article, the term "public data" refers to any data posted openly on the Internet and available for use by anyone who complies with the licensing terms of the data./ This definition goes beyond what is the typical scope of "open data", which usually refers only to government-released data.
One solution to these data needs is to generate synthetic data, or fake data to use layman's terms. Sometimes this is safe. But synthetic data is usually inappropriate for machine learning use cases because most datasets are too complex to fake correctly. More to the point, using synthetic data can also lead to misunderstandings during the development phase about how your machine learning model will perform with the intended data as you move onwards.
In a professional context using synthetic data is especially risky. If a model trained with synthetic data has worse performance than a model trained with the intended data, stakeholders may dismiss your work even though the model would have met their needs, in reality. If a model trained with synthetic data performs better than a model trained with the intended data, you create unrealistic expectations. Generally, you rarely know how the performance of your model will change when it is trained with a different dataset until you train it with that dataset.
Thus, using synthetic data creates a burden to communicate that any discussions of model performance are purely speculative. Model performance on substitute data is speculative as well, of course, but a model trained on a well-chosen substitute dataset will give closer performance to actual model trained on the intended data than a model trained on synthetic data.
If you feel you understand the intended data well enough to generate an essentially perfect synthetic dataset, then it is pointless to use machine learning since you already can predict the outlines. That is, the data you use for training should be random and used to see what the possible outcomes of this data, not to confirm what you already clearly know.
It is sometimes necessary to use synthetic data. It can be useful to use synthetic data in the following scenarios:
Be cautious when algorithmically generating data to demonstrate model training on large datasets. Many machine learning algorithms made for training models on large datasets are considerably optimized. These algorithms may detect simplistic (or overly noisy) data and train much faster than on real data.
Methodologies for generating synthetic data at large scale are beyond the scope of this guide. Google has released code to generate large-scale datasets. This code is well-suited for use cases like benchmarking the performance of schema-specific queries or data processing jobs. If the process you are testing is agnostic to the actual data values, it is usually safe to test with synthetic data.
When using public data you must comply with any licenses and restrictions governing the usage of this data. Just because a dataset is available publicly does not mean you have a right to use it, and similarly, just because you are unable to identify license terms for a dataset, does not mean that you have a right to use some data. Some public datasets and data repositories have requirements around citing or referencing the data source. Others may have restrictions on the types of uses of that data. It is important to comply with all legal requirements when using data from public sources. There is no substitute for seeking legal counsel before using public data, and know that nothing in this guide constitutes any kind of legal advice.
Web address: https://toolbox.google.com/datasetsearch License(s): Varies, often displayed with search results.
Google Dataset Search is the recommended starting point for any data search. Many popular public data repositories are indexed, including many of the sources listed in this document and almost all government data.
Dataset Search supports the site: operator. https://toolbox.google.com/datasetsearch/search?query=site%3Adata.gov%20purchase%20orders&docid=KGlM37oh1x%2Bo4yYsAAAAAA%3D%3D.
Web address: https://www.google.com/ License(s): Varies, often inconclusive.
Google searches with keywords like "data" and "dataset" will sometimes lead you to the data you need. Google Search has additional features that enable powerful dataset searches:
The site: search operator limits searches to a specific site.
Web address: https://www.kaggle.com/datasets License(s): Varies but displayed with results; ability to filter searches by license.
Search 10000+ datasets uploaded by Kaggle users or maintained by Kaggle. Many of the datasets are well-described, and some datasets also have example Jupyter notebooks ("Kernels"). Kaggle's search interface also has filtering by dataset attributes, including license, and summarizes dataset attributes (also including license) with search results.
Kaggle offers an API for easy downloading of datasets.
Web address: https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset License(s): Varies, stated with Terms of Service for each dataset.
Over 100 datasets are available to all Google Cloud Platform users as BigQuery tables. The quality of these datasets is very high, and many datasets are updated continuously.
These are some of the highest quality public datasets available. It is worth scanning the catalog to see what's available, since many common needs and use cases are supported by these datasets (weather, genomic, census, blockchain, patents, current events, sports, images, comments/forums, macroeconomic, and clickstream analytics, among other things).
Web address: https://ai.google/tools/datasets/ License(s): Varies.
Google regularly releases AI datasets to the public. Many of the datasets are existing data annotated with labels (either programmatically or manually), and most are unstructured data (text, video, images, sound).
Google Scholar is discussed below with academic literature search
Governments are a popular source of public data, and government data is often shared with generous license terms. But some government data is subject to restrictive licenses. You should not assume that all government datasets are suitable for use.
One unfortunate drawback of government open data sites is that many of the dataset results are redacted, and many "datasets" are hard-to-use file types like pdfs. One easy way to restrict the search on most CKAN cites is to explicitly include an extension in your search (compare to the results without an extension).
Web address: https://www.data.gov/ License(s): Varies, most search records include details.
100,000s of datasets are indexed by the U.S. Federal Government. Most of the data is from the U.S. Federal Government, but datasets from other levels of U.S. government and non-U.S. governments are also partially indexed. Ad-hoc searches show data.gov does a good job of indexing the listed U.S state and U.S. municipal open data sites.
Web address: https://www.europeandataportal.eu License(s): Varies, many search records include details.
This site indexes close to 1 million datasets, mainly data produced by the E.U. and EU member states (and, for now, the U.K.). 100,000+ datasets are in English, though some non-English datasets do have English search metadata. SPARQL querying is supported for searching linked data.
There are many data repositories hosting open government data. For the very thorough (or when the U.S. Federal Government is shut down), http://datacatalogs.org/search has a list of 500+ government open data repositories, and OpenDataSoft created a list of 2500+ open data repositories. Both of these links offer diminishing returns, and many of these long tail repositories are indexed by Google Dataset Search.
The Freedom of Information Act (FOIA) is a law giving full or partial access to previously unreleased data controlled by the U.S. Federal Government. Similar laws may exist at the U.S. state level, and in other countries. If you think a government agency may have data of interest to you, consider filing a FOIA or similar request. At least one very popular open dataset resulted from a FOIA request.
Making a FOIA or similar request is not always straightforward, though there is help available on the web. In the U.S., most agencies now have specific FOIA procedures that have to be followed (these procedures are usually documented on an agency's website), and the requestor has to pay the cost of obtaining the information, though the cost is usually minimal.
While a FOIA or similar request may seem like a lot to go through for a dataset, the costs are relatively low if the request is well-targeted and turnaround times are not unreasonable (the FOIA specifies 20 days).
Non-government organizations that host their own data.
Web address: https://archive.ics.uci.edu/ml/ License(s): Varies, may not be explicitly stated.
The UC Irvine Machine Learning Repository is one of the oldest and most popular data repositories. It indexes around 500 datasets and is heavily used in CS research--the number of times the UCI repository has been cited in scholarly research would place it in the top 100 most cited CS research papers of all time.
Web address: https://registry.opendata.aws/ License(s): Varies, but usually listed in a dataset's metadata.
Around 100 large public datasets are hosted on AWS, usually in S3 buckets. While the selection is small, the quality of the data is high and most of the datasets are voluminous and unstructured--documents, text, images, etc.
A few of the datasets are also available in Google Cloud Storage buckets.
Web address: https://cran.r-project.org/index.html License(s): Varies, but clearly stated with each R package.
CRAN is the centralized repository of R packages, it is not technically a data repository. But the 10,000+ packages available through CRAN have over 25,000 .rda and .Rdata files, and many are real, high quality (mostly structured) datasets. Further, the contents of those datasets are documented in line with CRAN's very high documentation standards.
Finding Datasets in CRAN. There is no easy way to search only the descriptions of packaged data files in CRAN, though it would be possible to build. But there are ways to do less-targeted searches:
Web address: https://www.openml.org/search?type=data License(s): Varies, but clearly stated with most datasets.
OpenML.org is an online platform for sharing machine learning data and experiments. Over 20,000 datasets are available on the platform, usually with indicated licenses, and many of the datasets have been reviewed by an administrator for quality.
Discover high quality datasets thanks to the collection of Huajin Wang, CMU.
Web address: https://msropendata.com/
A collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain specific sciences. Download or copy directly to a cloud-based Data Science Virtual Machine for a seamless development experience.
Web address: https://public.enigma.com/ License(s): CC BY-NC 4.0 (non-commercial)
Enigma is a data broker and services provider that maintains a repository of public data. The sources of the data appear to be FOIA requests, open government data, and scraped websites. There is a large quantity and variety of data, and the interface is well-designed.
Web address: http://academictorrents.com/ License(s): Varies, but usually listed in a dataset's metadata.
Close to 500 datasets are indexed by Academic Torrents. The majority of the datasets are unstructured (video, images, sound, text), and many are exceptionally large, including some terabyte-scale image and video datasets.
Academic Torrents does not host the data itself, instead providing .torrent files and usually a link to a website or paper with more information about the dataset.
Web address: https://www.reddit.com/r/datasets/ License(s): Varies, based on the linked dataset.
Reddit's r/datasets is a community of about 50,000 individuals dedicated to hunting down datasets. Users post requests for datasets and discuss data finding, usage, and purchase.
Reddit's search functionality allows searching only r/datasets, and many kinds of hard-to-find data have been discussed and shared on r/datasets. You can also start a new discussion to get help finding a specific dataset.
Search engines and information repositories not specifically intended for data provide ways to track down data. Using Google Search to find data is discussed above.
Web address: https://github.com/search?type=Code License(s): Varies.
Many developers store data with code, so a lot of data is stored on GitHub. GitHub code search indexes the contents of GitHub repos, and supports searching files by extension (example). The advanced search also allows filtering by license, which parses the LICENSE file found in many repos, though the license search may not be accurate or comprehensive.
Web address: https://archive.org/ License(s): Varies.
The Internet Archive is a "digital library of Internet sites and other cultural artifacts in digital form." It holds a large collection of different types of information, although much of it is disorganized, of unclear origin, and not easily automatically parsed (pdfs, images, audio, and video).
Note that licenses can be difficult to determine. If you cannot find a license, you may still be able to track down the original content owner.
Datasets are vital to research in machine learning and other fields, and close to 1,000 journals require data sharing as a condition for publication. Even when not required, authors often release their data (and code), and the tradition of information sharing in academia means that asking for data used in a publication often results in a response.
Web address: https://scholar.google.com/ License(s): N/A
Google Scholar is the most comprehensive index of academic literature on the web (here's a short overview of how to use Google Scholar). Simples searches can reveal many papers on a specific use case, and searches targeting specific types of data can be productive as well.
Web address: https://arxiv.org/ License(s): N/A
Arxiv is an archive of academic research (technically preprints) with a focus on the sciences. Because Arxiv is especially strong in computer science and machine learning, searching for types of data and specific machine learning problems can give very relevant results.
Once you've identified an academic publication that may be relevant, you have to track down the dataset.
Once you find where the dataset is described, look for citations, web links, or even a dataset name (common when using extremely popular datasets). But sometimes few details are shared (see "Contact the Authors" below).
Many journal articles now have code and data online, and Google Search and GitHub code search will sometimes give results. Searching for the name of the paper may also lead to someone who has implemented the paper's method independently, who may have their own dataset.
If authors cannot share data, they may be able to put you in touch with whoever provided them their data. Additionally, authors tend to know the problems they publish on well, and can maybe point you to similar datasets.
If authors do offer to share the data with you, make sure you understand the terms/license for using the data before using it. You may need to ask authors directly for this. Be aware it is very common for datasets used in academic research to forbid commercial use.
Sources for specific types of data, sometimes unlabeled.
Many websites list links to popular and useful datasets and data repositories. These ones are high quality, with relatively few broken links and a good variety of datasets and repositories.
Unlabeled data is easier to obtain than labeled data, and labeling data is easy and relatively straightforward. You may not need many labeled examples to make a useful dataset, making self-labeling a reasonable choice when you need a dataset.
Many organizations and teams discount the idea of creating their own labels as being too time consuming, too expensive, or even too boring. But for a demo, solution, proof-of-concept, or research project, modest amounts of labeled data may be all this is necessary. If you do not require a large amount of labeled data, and/or your labeling task is unusual, self-tagging may save you time and money versus a labeling service.
A key to successful do-it-yourself labeling is to not overcomplicate the labeling task. Many labeling tasks can be done with just some data munging and a spreadsheet. Focus on getting labels rather than task piloting and design, tool standup, and reliability measurement.
If you need a labeling tool, try to avoid building your own and consider some free and paid alternatives.
Labeling services are a popular way of getting labeled data. These services recruit and manage workers to do your labeling tasks.
Ideally, a labeling task labels neither too little nor too much data. Learning curves help determine if additional labels will improve performance of a model. Learning curves also provide reasonable estimates of the performance ceiling of a specific model+dataset.
The general idea behind learning curves is to train your model with progressively larger training sets and plot the performance metrics. This blog post goes into greater detail.
If you have some labeled data, you can create more labeled data using machine learning. Be aware that using machine learning to label unlabeled data has a risk —any information or patterns not present in your original labeled data but present in your unlabeled data will go unnoticed. But if you need more labels to show model training on large amounts of data, or if you want to train a model that is particularly data hungry, automatic label generation approaches may be suitable.
If you need additional labeled data to show model training at large scale or to demonstrate a modeling technique that requires a lot of labeled data, consider training a model on the data you have and labeling your unlabeled data with predictions from the model. If the modeling technique you intend to use is discriminative, you can use a similar generative modeling technique to label data and improve the performance of the discriminative model.
Machine learning methods specifically for labeling unlabeled data are an active area of research. Good starting points are scikit-learn's implementations of label propagation and label spreading.
Rarely will you find the exact dataset that you want, but you can almost always find something close to it. Often close is good enough. If the license of a dataset allows it, consider modifying the dataset into the data you want.
A good dataset for modification has similar structure to your ideal dataset and is governed by the same statistical processes. It may not be from the same domain, or have similar labels or field names—these are what you can change.
Also consider using a substitute dataset instead of modifying a dataset. Sometimes the goals of your machine learning work (especially for demos and proofs-of-concept) can be met by a less-than-perfect dataset that still extends to the problem you are discussing.
Your customer is a publisher of religious books, and you are demonstrating a model that predicts weekly sales of their books per store. This dataset of liquor sales in Iowa is probably suitable, but you may want to change the names of some columns, remove other columns, and change some of the categorical labels to be related to books rather than liquor.
For a research project, you want to demonstrate a machine learning model that counts the number of people in a park. The Stanford Drone Dataset has drone footage of various scenes with labeled bounding boxes around pedestrians. You can use this dataset to create a derivative dataset of still images with counts of pedestrians.
This dataset of pairs of sentences with a 1-5 label of how "semantically related" (same meaning) the two sentences are could be used for other purposes. For example, a modeling pipeline that performs well on this dataset could be the core component of a system that detects patents with a large number of claims that are similar to claims in other patents.
You want to demonstrate a system that can read an email and predict what product(s) are mentioned in the email. This Stack Overflow dataset, specifically the data of questions with topic/programming language tags, could let you demonstrate that predicting meaningful labels from a block of text is possible.
Below are some resources that allow you to search for massive amounts of data quickly:
Less conventional but nonetheless powerful data ways of finding data:
Alibaba Clouder - June 17, 2020
Alibaba Clouder - July 17, 2020
Alibaba Container Service - July 16, 2019
Alibaba Container Service - March 10, 2020
Alibaba Container Service - April 28, 2020
Alibaba Clouder - September 2, 2019
Conduct large-scale data warehousing with MaxComputeLearn More
A Big Data service that uses Apache Hadoop and Spark to process and analyze dataLearn More
A secure environment for offline data development, with powerful Open APIs, to create an ecosystem for redevelopment.Learn More
More Posts by Alibaba Clouder