×
Community Blog Finding Public Data for Your Machine Learning Pipelines

Finding Public Data for Your Machine Learning Pipelines

This article discusses how and where you can find public data to use in machine learning pipelines that you can then use in a variety of applications.

By Bo Yang, Alibaba Cloud Community Blog author.

The goal of the article is to help you find a dataset from public data that you can use for your machine learning pipeline, whether it be for a machine learning demo, proof-of-concept, or research project. It may not always be possible to collect your own data, but by using public data, you can create machine learning pipelines that can be useful for a large number of applications.

Machine Learning Requires Data

Machine learning requires data. Without data you cannot be sure a machine learning model works. However, the data you need may not always be readily available.

Data may not have been collected or labeled yet or may not be readily available for machine learning model development because of technological, budgetary, privacy, or security concerns. Especially in a business contexts, stakeholders want to see how a machine learning system will work before investing the time and money in collecting, labeling, and moving data into such a system. This makes finding substitute data necessary.

This article wants to provide some light into how to find and use public data for various machine learning applications such as machine learning demos, proofs-of-concept, or research projects. This article specifically looks into where you can find data for almost any use case, problems with synthetic data, and the potential issues with using public data. In this article, the term "public data" refers to any data posted openly on the Internet and available for use by anyone who complies with the licensing terms of the data./ This definition goes beyond what is the typical scope of "open data", which usually refers only to government-released data.

Be Careful When It Comes to Synthetic Data

One solution to these data needs is to generate synthetic data, or fake data to use layman's terms. Sometimes this is safe. But synthetic data is usually inappropriate for machine learning use cases because most datasets are too complex to fake correctly. More to the point, using synthetic data can also lead to misunderstandings during the development phase about how your machine learning model will perform with the intended data as you move onwards.

In a professional context using synthetic data is especially risky. If a model trained with synthetic data has worse performance than a model trained with the intended data, stakeholders may dismiss your work even though the model would have met their needs, in reality. If a model trained with synthetic data performs better than a model trained with the intended data, you create unrealistic expectations. Generally, you rarely know how the performance of your model will change when it is trained with a different dataset until you train it with that dataset.

Thus, using synthetic data creates a burden to communicate that any discussions of model performance are purely speculative. Model performance on substitute data is speculative as well, of course, but a model trained on a well-chosen substitute dataset will give closer performance to actual model trained on the intended data than a model trained on synthetic data.

If you feel you understand the intended data well enough to generate an essentially perfect synthetic dataset, then it is pointless to use machine learning since you already can predict the outlines. That is, the data you use for training should be random and used to see what the possible outcomes of this data, not to confirm what you already clearly know.

When to Use Synthetic Data

It is sometimes necessary to use synthetic data. It can be useful to use synthetic data in the following scenarios:

  • You need to demonstrate something at terabyte scale (there is a shortage of terabyte-scale public datasets).
  • You created a machine learning model or pipeline with internal data, are open sourcing the code, but cannot release the internal data.
  • You are building a reinforcement learning model, and require a simulated environment to train the model.
  • You are conducting fundamental machine learning research, and require data that is clearly understood.

Generating large scale datasets

Be cautious when algorithmically generating data to demonstrate model training on large datasets. Many machine learning algorithms made for training models on large datasets are considerably optimized. These algorithms may detect simplistic (or overly noisy) data and train much faster than on real data.

Methodologies for generating synthetic data at large scale are beyond the scope of this guide. Google has released code to generate large-scale datasets. This code is well-suited for use cases like benchmarking the performance of schema-specific queries or data processing jobs. If the process you are testing is agnostic to the actual data values, it is usually safe to test with synthetic data.

Tools for Generating Synthetic Machine Learning Data

  • sklearn.datasets.make: This popular scikit-learn Python package contains methods for generating data with specific properties. Basic methods for generating data appropriate for regression, classification, and clustering problems are available. These methods are in the sklearn.datasets module.
  • sympy: a Python package for symbolic mathematics. This blog post details how to use sympy to generate data that solves a specific symbolic expression (like y=x), with an option to introduce randomness/noise.
  • pydbgen: a Python package for generating synthetic structured database tables, including specific data fields like emails, phone numbers, names, or cities.
  • trumania: a Python package for scenario-based data generation. trumania uses probabilistic generation schemes where events have causal relationships, and writes the events in a predefined schema.
  • Mockaroo: Mockaroo is a website/API to generate structured data with realistic cell values (e.g., names, zip codes, dates), with the option to transform the generated values with custom functions. The free version generates 1000 rows at a time, and exports the rows in common tabular formats.

Legal Considerations When Using Public Data

When using public data you must comply with any licenses and restrictions governing the usage of this data. Just because a dataset is available publicly does not mean you have a right to use it, and similarly, just because you are unable to identify license terms for a dataset, does not mean that you have a right to use some data. Some public datasets and data repositories have requirements around citing or referencing the data source. Others may have restrictions on the types of uses of that data. It is important to comply with all legal requirements when using data from public sources. There is no substitute for seeking legal counsel before using public data, and know that nothing in this guide constitutes any kind of legal advice.

Finding Public Datasets

General Tips

  • Be flexible. It is unlikely you will find the exact data you are looking for. Consider expanding your search to similar datasets that may be easier to find, but that will still let you demonstrate the necessary functionality or will be easy to repurpose. For example, you probably cannot find daily retail sales for a large variety of products across the entire world, but maybe a dataset of weekly gasoline prices in different locations in the NYC metro can be adapted appropriately.
  • Do not assume data repositories index everything they purport to. Many data repositories are aggregations of other repositories, but often the indexing of the other repositories is outdated or incomplete. It is worthwhile to re-run your searches on repositories purportedly indexed by repositories you have already searched. Specific instances of this are called out with some data repositories below, but it is a general problem to be aware of.
  • Learn search syntax differences. Different repositories use different search implementations. Large websites and repositories will usually provide help on their flavor of search, though this help may not be easy to find. Smaller sites and repositories may also have search, but often without help pages or a clear indication of what engine powers the search. Many search implementations are powered by Solr or Elasticsearch, which both use Lucene for querying. Government repositories usually use CKAN. Google Search's site: operator sometimes provides better search results than a website's own search implementation.
  • Watch out for spam, junk, and phantom data. Some data repositories have fallen victim to spam and content hijacking, even if some of the content is still valuable. Other repositories index data that is not available or no longer exists. There is also the ever-present issue of data quality, and often no information on how a dataset was collected. Consider the quality of the data you find before ending your search.

Google Sources

Google Dataset Search

Web address: https://toolbox.google.com/datasetsearch License(s): Varies, often displayed with search results.

Google Dataset Search is the recommended starting point for any data search. Many popular public data repositories are indexed, including many of the sources listed in this document and almost all government data.

Dataset Search supports the site: operator. https://toolbox.google.com/datasetsearch/search?query=site%3Adata.gov%20purchase%20orders&docid=KGlM37oh1x%2Bo4yYsAAAAAA%3D%3D.

  • Limitations of Google Dataset Search. Google Dataset Search indexes datasets based on what structured information webmasters add to their websites. Not all data repositories provide this structured information, and some data repositories provide only part of their searchable metadata to Dataset Search. There can also be a delay before Dataset Search indexes new data posted to indexed repositories. If you cannot find a dataset on Google Dataset Search it is worth checking appropriate repositories, even if that repository is indexed by Google Dataset Search.

Google Search

Web address: https://www.google.com/ License(s): Varies, often inconclusive.

Google searches with keywords like "data" and "dataset" will sometimes lead you to the data you need. Google Search has additional features that enable powerful dataset searches:

  • The filetype: search operator allows for easy filtering of .csv files and other file types. Example.
  • The site: search operator limits searches to a specific site.

    • Google search may provide a useful index if a site that holds datasets or links to datasets. Example.
    • May also be useful for finding license and usage information for web domains that hold datasets. Example.

Kaggle Datasets

Web address: https://www.kaggle.com/datasets License(s): Varies but displayed with results; ability to filter searches by license.

Search 10000+ datasets uploaded by Kaggle users or maintained by Kaggle. Many of the datasets are well-described, and some datasets also have example Jupyter notebooks ("Kernels"). Kaggle's search interface also has filtering by dataset attributes, including license, and summarizes dataset attributes (also including license) with search results.

Kaggle offers an API for easy downloading of datasets.

  • Most Kaggle competition data is likely not suitable for use Kaggle competition data almost always has a very restrictive license forbidding any non-competition usage. As always, review a dataset's licence terms before using it.

Google Cloud Marketplace Datasets

Web address: https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset License(s): Varies, stated with Terms of Service for each dataset.

Over 100 datasets are available to all Google Cloud Platform users as BigQuery tables. The quality of these datasets is very high, and many datasets are updated continuously.

These are some of the highest quality public datasets available. It is worth scanning the catalog to see what's available, since many common needs and use cases are supported by these datasets (weather, genomic, census, blockchain, patents, current events, sports, images, comments/forums, macroeconomic, and clickstream analytics, among other things).

Google AI Datasets

Web address: https://ai.google/tools/datasets/ License(s): Varies.

Google regularly releases AI datasets to the public. Many of the datasets are existing data annotated with labels (either programmatically or manually), and most are unstructured data (text, video, images, sound).

Google Scholar

Google Scholar is discussed below with academic literature search

Government Data

Governments are a popular source of public data, and government data is often shared with generous license terms. But some government data is subject to restrictive licenses. You should not assume that all government datasets are suitable for use.

  • Using CKAN. Most government data repositories are built on CKAN, which has its own search syntax. Sites that index datasets in government data repositories may not capture all the metadata in CKAN, making local searches on government data repositories worthwhile (though a CKAN site indexed by another CKAN site is more likely to be well-indexed).

One unfortunate drawback of government open data sites is that many of the dataset results are redacted, and many "datasets" are hard-to-use file types like pdfs. One easy way to restrict the search on most CKAN cites is to explicitly include an extension in your search (compare to the results without an extension).

United States (data.gov)

Web address: https://www.data.gov/ License(s): Varies, most search records include details.

100,000s of datasets are indexed by the U.S. Federal Government. Most of the data is from the U.S. Federal Government, but datasets from other levels of U.S. government and non-U.S. governments are also partially indexed. Ad-hoc searches show data.gov does a good job of indexing the listed U.S state and U.S. municipal open data sites.

European Union

Web address: https://www.europeandataportal.eu License(s): Varies, many search records include details.

This site indexes close to 1 million datasets, mainly data produced by the E.U. and EU member states (and, for now, the U.K.). 100,000+ datasets are in English, though some non-English datasets do have English search metadata. SPARQL querying is supported for searching linked data.

Other Government Data

There are many data repositories hosting open government data. For the very thorough (or when the U.S. Federal Government is shut down), http://datacatalogs.org/search has a list of 500+ government open data repositories, and OpenDataSoft created a list of 2500+ open data repositories. Both of these links offer diminishing returns, and many of these long tail repositories are indexed by Google Dataset Search.

  • Other government data repositories with 10,000+ English datasets

Freedom of Information Act Requests

The Freedom of Information Act (FOIA) is a law giving full or partial access to previously unreleased data controlled by the U.S. Federal Government. Similar laws may exist at the U.S. state level, and in other countries. If you think a government agency may have data of interest to you, consider filing a FOIA or similar request. At least one very popular open dataset resulted from a FOIA request.

Making a FOIA or similar request is not always straightforward, though there is help available on the web. In the U.S., most agencies now have specific FOIA procedures that have to be followed (these procedures are usually documented on an agency's website), and the requestor has to pay the cost of obtaining the information, though the cost is usually minimal.

While a FOIA or similar request may seem like a lot to go through for a dataset, the costs are relatively low if the request is well-targeted and turnaround times are not unreasonable (the FOIA specifies 20 days).

Other Data Repositories

Non-government organizations that host their own data.

UC Irvine Machine Learning Repository

Web address: https://archive.ics.uci.edu/ml/ License(s): Varies, may not be explicitly stated.

The UC Irvine Machine Learning Repository is one of the oldest and most popular data repositories. It indexes around 500 datasets and is heavily used in CS research--the number of times the UCI repository has been cited in scholarly research would place it in the top 100 most cited CS research papers of all time.

Registry of Open Data on AWS

Web address: https://registry.opendata.aws/ License(s): Varies, but usually listed in a dataset's metadata.

Around 100 large public datasets are hosted on AWS, usually in S3 buckets. While the selection is small, the quality of the data is high and most of the datasets are voluminous and unstructured--documents, text, images, etc.

A few of the datasets are also available in Google Cloud Storage buckets.

CRAN (The Comprehensive R Archive Network)

Web address: https://cran.r-project.org/index.html License(s): Varies, but clearly stated with each R package.

CRAN is the centralized repository of R packages, it is not technically a data repository. But the 10,000+ packages available through CRAN have over 25,000 .rda and .Rdata files, and many are real, high quality (mostly structured) datasets. Further, the contents of those datasets are documented in line with CRAN's very high documentation standards.

OpenML: Open Machine Learning

Web address: https://www.openml.org/search?type=data License(s): Varies, but clearly stated with most datasets.

OpenML.org is an online platform for sharing machine learning data and experiments. Over 20,000 datasets are available on the platform, usually with indicated licenses, and many of the datasets have been reviewed by an administrator for quality.

Find Datasets | CMU Libraries

Web address: https://guides.library.cmu.edu/machine-learning/datasets

Discover high quality datasets thanks to the collection of Huajin Wang, CMU.

Microsoft Research Open Data

Web address: https://msropendata.com/

A collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain specific sciences. Download or copy directly to a cloud-based Data Science Virtual Machine for a seamless development experience.

Dataset Search Tools

Enigma Public

Web address: https://public.enigma.com/ License(s): CC BY-NC 4.0 (non-commercial)

Enigma is a data broker and services provider that maintains a repository of public data. The sources of the data appear to be FOIA requests, open government data, and scraped websites. There is a large quantity and variety of data, and the interface is well-designed.

  • Limitations on Commercial Use. The data in Enigma is under a license that forbids commercial use. However, the site sometimes links to the source of the data, where the data may be available under different license terms. Many datasets indexed by Enigma Public are also available elsewhere on the web, or are FOIA request data that can presumably be re-requested.

Academic Torrents

Web address: http://academictorrents.com/ License(s): Varies, but usually listed in a dataset's metadata.

Close to 500 datasets are indexed by Academic Torrents. The majority of the datasets are unstructured (video, images, sound, text), and many are exceptionally large, including some terabyte-scale image and video datasets.

Academic Torrents does not host the data itself, instead providing .torrent files and usually a link to a website or paper with more information about the dataset.

  • Try to download .torrents, even if you can find the data elsewhere. Alternative means of download are often available from the home websites of the datasets listed on Academic Torrents, but these large datasets can be expensive to host. Academic Torrents exists specifically to reduce the financial burden on the mostly academic and non-commercial releasers of these large datasets.

Reddit r/datasets

Web address: https://www.reddit.com/r/datasets/ License(s): Varies, based on the linked dataset.

Reddit's r/datasets is a community of about 50,000 individuals dedicated to hunting down datasets. Users post requests for datasets and discuss data finding, usage, and purchase.

Reddit's search functionality allows searching only r/datasets, and many kinds of hard-to-find data have been discussed and shared on r/datasets. You can also start a new discussion to get help finding a specific dataset.

General Search Tools and Repositories

Search engines and information repositories not specifically intended for data provide ways to track down data. Using Google Search to find data is discussed above.

GitHub Code Search

Web address: https://github.com/search?type=Code License(s): Varies.

Many developers store data with code, so a lot of data is stored on GitHub. GitHub code search indexes the contents of GitHub repos, and supports searching files by extension (example). The advanced search also allows filtering by license, which parses the LICENSE file found in many repos, though the license search may not be accurate or comprehensive.

Internet Archive

Web address: https://archive.org/ License(s): Varies.

The Internet Archive is a "digital library of Internet sites and other cultural artifacts in digital form." It holds a large collection of different types of information, although much of it is disorganized, of unclear origin, and not easily automatically parsed (pdfs, images, audio, and video).

Note that licenses can be difficult to determine. If you cannot find a license, you may still be able to track down the original content owner.

  • Be prepared to label your own data. The Internet Archive is a near-limitless source of raw data that with some labeling can form very unique, in-demand, datasets. The Internet Archive also offers powerful search capabilities, and does a good job tracking the type of information for easy filtering of results by images, text, video, datasets, etc.

Academic Literature Search

Datasets are vital to research in machine learning and other fields, and close to 1,000 journals require data sharing as a condition for publication. Even when not required, authors often release their data (and code), and the tradition of information sharing in academia means that asking for data used in a publication often results in a response.

Google Scholar

Web address: https://scholar.google.com/ License(s): N/A

Google Scholar is the most comprehensive index of academic literature on the web (here's a short overview of how to use Google Scholar). Simples searches can reveal many papers on a specific use case, and searches targeting specific types of data can be productive as well.

  • Don't get discouraged by journal paywalls. The left side Google Scholar results often links directly to a non-paywalled version of a paper. If a non-paywalled result is not linked directly, searching for the authors/title often gives a result, since authors may have the right to distribute their work on their own website. If you cannot find an article, your organization or a local library may have access rights to the journal or book.

Arxiv

Web address: https://arxiv.org/ License(s): N/A

Arxiv is an archive of academic research (technically preprints) with a focus on the sciences. Because Arxiv is especially strong in computer science and machine learning, searching for types of data and specific machine learning problems can give very relevant results.

Getting Data Mentioned in Academic Publications

Once you've identified an academic publication that may be relevant, you have to track down the dataset.

  • Determine what dataset the publication uses. For most academic research the data is secondary to the discoveries and conclusions of the research. Often the source and structure of the dataset are buried in the journal article. Searches for words like "data", "dataset", "examples", etc. can help, as can looking for sections like "Experiment(s)", "Evaluation", "Methodology", and "Result(s)".

Once you find where the dataset is described, look for citations, web links, or even a dataset name (common when using extremely popular datasets). But sometimes few details are shared (see "Contact the Authors" below).

  • Search the Internet. If the dataset has a citation, or is discussed in a way that suggests it is common to the field of the publication, you should be able to track the dataset down by searching for the dataset name or the cited journal article, website, or publication.

Many journal articles now have code and data online, and Google Search and GitHub code search will sometimes give results. Searching for the name of the paper may also lead to someone who has implemented the paper's method independently, who may have their own dataset.

  • Contact the Authors. Often the data for a publication cannot be found online, but you can always contact the paper authors directly and ask for access to their data.

If authors cannot share data, they may be able to put you in touch with whoever provided them their data. Additionally, authors tend to know the problems they publish on well, and can maybe point you to similar datasets.

If authors do offer to share the data with you, make sure you understand the terms/license for using the data before using it. You may need to ask authors directly for this. Be aware it is very common for datasets used in academic research to forbid commercial use.

Data by Modality

Sources for specific types of data, sometimes unlabeled.

Images

  • Creative Commons Search: Searchable database of images, including the ability to filter by license.
  • ImageNet: Searchable database of images. Downloading raw images is restricted, but features derived from images (like bounding boxes) have fewer restrictions.
  • VisualData: Searchable database of datasets for common computer vision problems, also includes a few video datasets.
  • xView: xView is one of the largest publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.
  • Labelme: A large dataset of annotated images.
  • LSUN: Scene understanding with many ancillary tasks (such as room layout estimation, saliency prediction)
  • MS COCO: Generic image understanding and captioning.
  • COIL100: 100 different objects imaged at every angle in a 360 rotation.
  • Visual Genome: Very detailed visual knowledge base with captioning of ~100K images.
  • Google's Open Images: A collection of 9 million URLs to images "that have been annotated with labels spanning over 6,000 categories" under Creative Commons.
  • Stanford Dogs Dataset: Contains 20,580 images and 120 different dog breed categories.
  • Indoor Scene Recognition: A very specific dataset and very useful, as most scene recognition models are better 'outside'. Contains 67 Indoor categories, and 15620 images.

Documents

  • DocumentCloud: Database of documents stored by journalists, who may release the documents as part of their reporting. Note that the terms of service for this data are unique, and a legal opinion is recommended before use.

Text

  • Tracking Progress in Natural Language Processing: Tracks the state-of-the art of different natural language processing tasks on different datasets, usually with links to the datasets.
  • HotspotQA Dataset: Question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.
  • Enron Dataset: Email data from the senior management of Enron, organized into folders.
  • Amazon Reviews: Contains around 35 million reviews from Amazon spanning 18 years. Data include product and user information, ratings, and the plaintext review.
  • Google Books Ngrams: A collection of words from Google books.
  • Blogger Corpus: A collection 681,288-blog posts gathered from blogger.com. Each blog contains a minimum of 200 occurrences of commonly used English words.
  • Wikipedia Links data: The full text of Wikipedia. The dataset contains almost 1.9 billion words from more than 4 million articles. You can search by word, phrase or part of a paragraph itself.
  • Gutenberg eBooks List: Annotated list of ebooks from Project Gutenberg.
  • Hansards text chunks of Canadian Parliament: 1.3 million pairs of texts from the records of the 36th Canadian Parliament.
  • Jeopardy: Archive of more than 200,000 questions from the quiz show Jeopardy.
  • Rotten Tomatoes Reviews: Archive of more than 480,000 critic reviews (fresh or rotten).
  • SMS Spam Collection in English: A dataset that consists of 5,574 English SMS spam messages
  • Yelp Reviews: An open dataset released by Yelp, contains more than 5 million reviews.
  • UCI's Spambase: A large spam email dataset, useful for spam filtering.
  • Multidomain sentiment analysis dataset: A slightly older dataset that features product reviews from Amazon.
  • IMDB reviews: An older, relatively small dataset for binary sentiment classification features 25,000 movie reviews.
  • Stanford Sentiment Treebank: Standard sentiment dataset with sentiment annotations.
  • Sentiment140: A popular dataset, which uses 160,000 tweets with emoticons pre-removed.
  • Twitter US Airline Sentiment: Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets

Self-driving (Autonomous Driving) Datasets

  • Berkeley DeepDrive BDD100k: Currently the largest dataset for self-driving AI. Contains over 100,000 videos of over 1,100-hour driving experiences across different times of the day and weather conditions. The annotated images come from New York and San Francisco areas.
  • Baidu Apolloscapes: Large dataset that defines 26 different semantic items such as cars, bicycles, pedestrians, buildings, streetlights, etc.
  • Comma.ai: More than 7 hours of highway driving. Details include car's speed, acceleration, steering angle, and GPS coordinates.
  • Oxford's Robotic Car: Over 100 repetitions of the same route through Oxford, UK, captured over a period of a year. The dataset captures different combinations of weather, traffic and pedestrians, along with long-term changes such as construction and roadworks.
  • Cityscape Dataset: A large dataset that records urban street scenes in 50 different cities.
  • CSSAD Dataset: This dataset is useful for perception and navigation of autonomous vehicles. The dataset skews heavily on roads found in the developed world.
  • KUL Belgium Traffic Sign Dataset: More than 10000+ traffic sign annotations from thousands of physically distinct traffic signs in the Flanders region in Belgium.
  • MIT AGE Lab: A sample of the 1,000+ hours of multi-sensor driving datasets collected at AgeLab.
  • LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets: This dataset includes traffic signs, vehicles detection, traffic lights, and trajectory patterns.
  • Bosch Small Traffic Light Dataset: Dataset for small traffic lights for deep learning.
  • LaRa Traffic Light Recognition: Another dataset for traffic lights. This is taken in Paris.
  • WPI datasets: Datasets for traffic lights, pedestrian and lane detection.

Clinical Datasets

  • MIMIC-III: Openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.

Housing Datasets

  • Boston Housing Dataset: Contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive, and has been used extensively throughout the literature to benchmark algorithms.

Linked Data (RDF)

Manually-Curated Dataset Lists

Many websites list links to popular and useful datasets and data repositories. These ones are high quality, with relatively few broken links and a good variety of datasets and repositories.

  • Awesome Public Datasets: A topic/domain separated list of a few hundred datasets.
  • Data is Plural is a weekly newsletter highlighting interesting datasets. Hundreds of past datasets are described in detail.
  • 1001 Datasets and Dataset Repositories: A bit disorganized, but also has inline copies of other dataset lists, making it easy to browse a lot of manually curated lists from one place.
  • Open Datasets: About 100 datasets separated by the type of data.

Labeling Data

Unlabeled data is easier to obtain than labeled data, and labeling data is easy and relatively straightforward. You may not need many labeled examples to make a useful dataset, making self-labeling a reasonable choice when you need a dataset.

Do It Yourself

Many organizations and teams discount the idea of creating their own labels as being too time consuming, too expensive, or even too boring. But for a demo, solution, proof-of-concept, or research project, modest amounts of labeled data may be all this is necessary. If you do not require a large amount of labeled data, and/or your labeling task is unusual, self-tagging may save you time and money versus a labeling service.

A key to successful do-it-yourself labeling is to not overcomplicate the labeling task. Many labeling tasks can be done with just some data munging and a spreadsheet. Focus on getting labels rather than task piloting and design, tool standup, and reliability measurement.

If you need a labeling tool, try to avoid building your own and consider some free and paid alternatives.

Free Software

  • LabelImg: An open source application for annotating bounding boxes on images.
  • BRAT: A web-based tool for text annotation tasks. A self-hosting version is currently "experimental", and this may be suitable for smaller tagging tasks. The non-experimental version requires you to run a web host.

Paid Software

  • RectLabel: A MacOS app supporting most image annotation tasks. Not free but costs only a few dollars.

Platforms/SaaS

  • Labelbox: A popular platform for image annotation offering a balance of easy setup, a nice interface, and flexible task design. The first 5000 annotations are free.
  • Tagtog: A platform for text annotation and document labeling. It works natively on PDFs and is easy to get started with, but the free tier is only suitable for very small projects.
  • Dataturks: Another platform for image and text annotation. Dataturks offers a free service for up to 10k labels, although your labeled data will be publicly available.

Labeling Services (Human Computation Services)

Labeling services are a popular way of getting labeled data. These services recruit and manage workers to do your labeling tasks.

  • Figure Eight (formerly CrowdFlower): A popular platform-as-a-service for data labeling, enrichment, and cleaning. Figure Eight also Integrates with other services (including Google Cloud Platform) and offers professional services to help you setup a project or manage an entire labeling task.
  • iMerit: Provides labor for data labeling, enrichment, and cleaning, and integrates directly with Google Cloud Platform. iMerit maintains a fully-employed workforce, rather than crowdsourcing tagging resources for each task. They can operate with resources dedicated solely to your project, and support enterprise-level needs like NDAs, cleanrooms, government clearances, and the option to use workers off or on shore.
  • Google Cloud AutoML Vision Human Labeling Service: AutoML Vision on Google Cloud Platform has a fully-featured integrated service for labeling images.

How Much Labeled Data is Enough?

Ideally, a labeling task labels neither too little nor too much data. Learning curves help determine if additional labels will improve performance of a model. Learning curves also provide reasonable estimates of the performance ceiling of a specific model+dataset.

The general idea behind learning curves is to train your model with progressively larger training sets and plot the performance metrics. This blog post goes into greater detail.

Using Machine Learning to Label

If you have some labeled data, you can create more labeled data using machine learning. Be aware that using machine learning to label unlabeled data has a risk —any information or patterns not present in your original labeled data but present in your unlabeled data will go unnoticed. But if you need more labels to show model training on large amounts of data, or if you want to train a model that is particularly data hungry, automatic label generation approaches may be suitable.

  • Use your model or a similar model

If you need additional labeled data to show model training at large scale or to demonstrate a modeling technique that requires a lot of labeled data, consider training a model on the data you have and labeling your unlabeled data with predictions from the model. If the modeling technique you intend to use is discriminative, you can use a similar generative modeling technique to label data and improve the performance of the discriminative model.

  • Unsupervised and semi-supervised techniques

Machine learning methods specifically for labeling unlabeled data are an active area of research. Good starting points are scikit-learn's implementations of label propagation and label spreading.

Modifying Real Data and Substituting Data

Rarely will you find the exact dataset that you want, but you can almost always find something close to it. Often close is good enough. If the license of a dataset allows it, consider modifying the dataset into the data you want.

A good dataset for modification has similar structure to your ideal dataset and is governed by the same statistical processes. It may not be from the same domain, or have similar labels or field names—these are what you can change.

Also consider using a substitute dataset instead of modifying a dataset. Sometimes the goals of your machine learning work (especially for demos and proofs-of-concept) can be met by a less-than-perfect dataset that still extends to the problem you are discussing.

Examples of Dataset Modification

  • Product sales per week

Your customer is a publisher of religious books, and you are demonstrating a model that predicts weekly sales of their books per store. This dataset of liquor sales in Iowa is probably suitable, but you may want to change the names of some columns, remove other columns, and change some of the categorical labels to be related to books rather than liquor.

  • Automatic crowd counting

For a research project, you want to demonstrate a machine learning model that counts the number of people in a park. The Stanford Drone Dataset has drone footage of various scenes with labeled bounding boxes around pedestrians. You can use this dataset to create a derivative dataset of still images with counts of pedestrians.

Examples of Substitute Datasets

  • Finding infringing patents

This dataset of pairs of sentences with a 1-5 label of how "semantically related" (same meaning) the two sentences are could be used for other purposes. For example, a modeling pipeline that performs well on this dataset could be the core component of a system that detects patents with a large number of claims that are similar to claims in other patents.

  • Customer service email routing

You want to demonstrate a system that can read an email and predict what product(s) are mentioned in the email. This Stack Overflow dataset, specifically the data of questions with topic/programming language tags, could let you demonstrate that predicting meaningful labels from a block of text is possible.

Additional Resources

0 0 0
Share on

Alibaba Clouder

2,603 posts | 747 followers

You may also like

Comments