Datasets for Machine Learning

Best public datasets for machine learning, data science, sentiment analysis, computer vision, natural language processing (NLP), clinical data, and others.

Dataset Finders

Google Dataset Search: Similar to how Google Scholar works, Dataset Search lets you find data wherever they are hosted, whether it’s a publisher’s site, a digital library, or an author’s web page. It’s a phenomenal data finder, and it contains over 25 million datasets.
Kaggle: Kaggle provides a vast container of data, sufficient for the enthusiast to the expert.
UCI Machine Learning Repository: The Machine Learning Repository at UCI provides an up to date resource for open-source datasets.
VisualData: Discover computer vision datasets by category; it allows searchable queries.
CMU LibrariesDiscover high-quality data sets thanks to the collection of Huajin Wang, at CMU.
The Big Bad NLP Database: This cool data set list contains data for various natural language processing tasks, created and curated by Quantum Stat.

Machine Learning Datasets

General Datasets

Housing Datasets
Boston Housing: Contains information collected by the US Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive and has been used extensively throughout the literature to benchmark algorithms.

Geographic Datasets

Google-Landmarks-v2: An improved dataset for landmark recognition and retrieval. This data set contains 5M+ images of 200k+ landmarks from across the world, sourced and annotated by the Wiki Commons community.

Machine Learning Datasets:

Mall CustomersThe Mall customers dataset contains information about people visiting the mall in a particular city. The data set consists of various columns like gender, customer id, age, annual income, and spending score. It’s generally used to segment customers based on their age, income, and interest.
IRIS: The iris dataset is a simple and beginner-friendly data set that contains information about the flower petal and sepal width. The data is divided into three classes, with 50 rows in each class. It’s generally used for classification and regression modeling.
MNIST: This is a database of handwritten digits. It contains 60,000 training images and 10,000 testing images. This is a perfect data set to start implementing image classification where you can classify a digit from 0 to 9.
Boston Housing: Contains information collected by the US Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive and has been used extensively throughout the literature to benchmark algorithms.
Fake News Detection: It is a CSV file that has 7796 rows with four columns. There are four columns: news, title, news text, result.
Wine quality: The data set contains different chemical information about the wine. The data set is suitable for classification and regression tasks.
SOCR data — Heights and Weights: This is a basic data set for beginners. It contains only the height and weights of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the height or weight of a human.
Titanic: The data set contains information like name, age, sex, number of siblings aboard, and other information about 891 passengers in the training set and 418 passengers in the testing set.
Credit Card Fraud Detection: The data set contains transactions made by credit cards; they are labeled as fraudulent or genuine. This is important for companies that have transaction systems to build a model for detecting fraudulent activities.

Machine Learning: Dataset

Computer Vision Datasets

xViewxView is one of the most massive publicly available data sets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.
ImageNet: The largest image data set for computer vision. It provides an accessible image database that is organized hierarchically, according to WordNet.
Kinetics-700A large-scale data set of video URLs from Youtube. Including human-centered actions. It contains over 700,000 videos.
Google’s Open ImagesA vast data set from Google AI containing over 10 million images.
CityscapesThis is an open-source data set for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The data is useful in semantic segmentation and training deep neural networks to understand the urban scene.
IMDB-WikiThe IMDB-Wiki data set is one of the most extensive open-source data for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has five million-plus labeled images.
Color Detection: The data set contains a CSV file that has 865 color names with their corresponding RGB(red, green, and blue) values of the color. It also has the hexadecimal value of the color.
Stanford Dogs: It contains 20,580 images and 120 different dog breed categories.

Sentiment Analysis Datasets

Lexicoder Sentiment DictionaryThis data set is specific for sentiment analysis. The data set contains over 3000 negative words and over 2000 positive sentiment words.
IMDB reviews: An interesting data set with over 50,000 movie reviews from Kaggle.
Stanford Sentiment Treebank: Standard sentiment data set with sentiment annotations.
Twitter US Airline Sentiment: Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets

Natural Language Processing (NLP) Datasets

The Big Bad NLP Database: This cool dataset list contains datasets for various natural language processing tasks, created and curated by Quantum Stat.
HotspotQA: Question answering data set featuring natural, multi-hop questions, with intense supervision for supporting facts to enable more explainable question answering systems.
Amazon Reviews: A vast data set from Amazon, containing over 45 million Amazon reviews.
Rotten Tomatoes Reviews: Archive of more than 480,000 critic reviews (fresh or rotten).
SMS Spam Collection in English: A data set that consists of 5,574 English SMS spam messages.
Enron Email: It contains around 0.5 million emails of over 150 users.
Recommender SystemsIt contains various data sets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, and others that are used in building a recommender system.
UCI SpambaseClassifying emails as spam or non-spam is a prevalent and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.
IMDB reviewsThe large movie review dataset consists of movie reviews from IMDB website with over 25,000 reviews for training and 25,000 for the testing set.

Self-driving (Autonomous Driving) Datasets

Waymo OpenThis is a fantastic dataset resource from the folks at Waymo. Includes a vast dataset of autonomous driving, enough to train deep nets from zero.
Berkeley DeepDrive BDD100k: One of the largest data sets for self-driving cars, containing over 2000 hours of driving experiences across New York and California.
Bosch Small Traffic Light: Dataset for small traffic lights for deep learning.
LaRa Traffic Light Recognition: Another data set for traffic lights. This data set is gathered from Paris.
WPI datasets: Data sets for traffic lights, pedestrian, and lane detection. It contains details such as a car’s speed, acceleration, steering angle, and GPS coordinates.
MIT AGE Lab: A sample of the 1,000+ hours of multi-sensor driving data sets collected at AgeLab.
LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego DatasetsThis data set includes traffic signs, vehicle detection, traffic lights, and trajectory patterns.
Cityscape: This is an extensive dataset that has street scenes in 50 different cities.

Clinical Datasets

MaskedFace-NetMaskedFace-Net is a real dat containing human faces with correct and incorrectly worn masks. It contains over 137k images which are based on the Flick-Faces-HQ dataset [21]. For more details about the data set and its uses, please visit the documentation on Github.
COVID-19: The Allen Institute of AI research has released a vast research dataset of over 45,000 scholarly articles about COVID-19.
MIMIC-III: Openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.

Datasets for Recommender Systems

MovieLens: It contains rating data sets from the MovieLens web site.
Jester: It contains 4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users. It’s mostly used for the collaborative filter.
Million Song: It can be used for both collaborative and content-based filtering.
DISCLAIMER: The perspectives communicated in this article are those of the author(s) and don’t address the perspectives on Carnegie Mellon College. These works don’t expect to be end results, yet rather an impression of current reasoning, alongside being an impetus for conversation and improvement.

In this post, we covered good places to find datasets for any type of data science project. We hope that you find something interesting that you want to sink your teeth into!

24x7offshoring is an Translation, AI Data Collection, and AI Services Company based out of Delhi, India. Having a wide exposure in over 2649 Medium to large scale projects, across 5 continents, 24x7offshoring is the one-stop solution for Large corporations across the world. We help our clients offshore outsourcing, AI Data collection, Translation, AI Training Data and several other domains. 24x7offshoring. If you’re interested you can contact us.

Table of Contents