70+ Machine Data Setting & Project Ideas – Work on Real-Time Data Science projects

We offer you a bright future with free online courses Start Now !!

Finding the right database Machine while researching machine learning or data science projects is a daunting task. Also, to build accurate models, you need a large amount of data. But don’t worry, there are many researchers, organizations, and individuals who have shared their work and we can use their data sets for our projects. In this article, we will discuss more than 70 machine learning databases that you can use to build your next data science project.

Machine learning data sets

Here are some data sets you may use while working on any data science or machine learning project:

Data sets for machine learning beginners in science science


1. Store Customer Data Set

In today’s data-driven world, customer data sets have become invaluable assets for businesses, particularly for those operating in the retail industry. These data sets provide valuable insights into customer behavior, preferences, and purchasing patterns. In this article, we will explore the significance of store customer data sets, their importance for businesses, and the applications they offer.

  1. Understanding Customer Behavior: Store customer data sets offer businesses a deep understanding of customer behavior and buying patterns. By analyzing data on customer demographics, browsing habits, purchase history, and preferences, businesses can identify trends and patterns that can inform marketing strategies, product development, and customer experience enhancements. This understanding enables businesses to tailor their offerings to meet customer needs and provide personalized experiences, fostering customer loyalty and satisfaction.
  2. Targeted Marketing and Advertising: Customer data sets empower businesses to implement targeted marketing and advertising campaigns. By segmenting customers based on various criteria, such as age, gender, location, or purchasing history, businesses can create tailored promotional materials, offers, and recommendations. This targeted approach increases the effectiveness of marketing efforts, resulting in higher conversion rates, improved customer engagement, and enhanced return on investment.
  3. Product Development and Innovation: Store customer data sets serve as a valuable resource for product development and innovation. By analyzing customer feedback, reviews, and preferences, businesses can gain insights into areas for improvement or identify new product opportunities. Understanding customer demands and preferences helps businesses stay ahead of the competition by delivering products that align with customer expectations and market trends.
  4. Customer Retention and Loyalty Programs: Customer data sets play a crucial role in customer retention and loyalty programs. By tracking customer purchase history, businesses can identify loyal customers and develop targeted loyalty programs to reward and incentivize repeat purchases. Personalized offers, discounts, and rewards based on customer preferences enhance customer loyalty, increase customer lifetime value, and promote long-term relationships with the brand.
  5. Operational Efficiency and Inventory Management: Store customer data sets aid in operational efficiency and inventory management. By analyzing customer buying patterns and demand forecasts, businesses can optimize their inventory levels, ensure stock availability for popular items, and avoid overstocking or understocking. This data-driven approach minimizes inventory costs, reduces waste, and improves overall operational efficiency.
  6. Customer Service and Support: Customer data sets provide valuable insights for delivering exceptional customer service and support. By analyzing customer interactions, feedback, and support tickets, businesses can identify areas of improvement, address customer concerns proactively, and provide timely and personalized support. This leads to higher customer satisfaction, increased trust in the brand, and positive word-of-mouth recommendations.
  7. Data-Driven Decision-Making: Store customer data sets enable data-driven decision-making, allowing businesses to make informed choices based on factual insights rather than assumptions. By analyzing customer data, businesses can identify trends, assess the success of marketing campaigns, and optimize various aspects of their operations. Data-driven decision-making promotes agility, efficiency, and strategic growth for businesses.

Mall’s customer data set contains information about people visiting this mall. The database contains gender, customer id, age, annual income, and spending school. Collects information from data and group clients based on their behavior.

1.1 Data Link: a set of supermarkets for supermarkets

1.2 Data Science Project Vision: Divide clients by age, gender, and interest. Customer segregation is an important practice of dividing the customer base into equal groups. Useful for customized advertising.

1.3 Source Code: Customer Dedication Project

2. Iris Dataset

The iris data set is a simple and suitable data set for beginners that contains information about the size of the flower and the size of the sepal. The database has 3 classes of 50 characters in each class, so it contains only 150 rows and 4 columns.

2.1 Data Connector: Iris data set

2.2 Data Science Project Vision: Use a classification device or a data backing model. Separation is the work of dividing things into their corresponding class.

3. MNIST data set

This is a handwritten digital website. Contains 60,000 training photos and 10,000 test images. This is the first complete database to use image classification where you can split digits from 0 to 9.

3.1 Data Link: MNIST data set

3.2 Data Science Project Vision: Use a machine-readable algorithm for visualization to see handwritten digits on paper.

3.3 Source Code: Handwritten Digital Recognition for In-depth Reading

4. Boston Housing Dataset

This is a popular database used for pattern detection. Contains information about the various houses in Boston based on crime rates, taxes, number of rooms, etc. It has 506 rows and 14 different variables in columns. You can use this database to predict house prices.

4.1 Data Link: Boston Data Set

4.2 Data Science Project Vision: Predict house prices for new homes using linear regression. Line regression is used to predict unknown input values ​​when the data has a line relationship between input and output variables.

5. False News Detection Data Set

CSV file with 7796 rows and 4 columns. The first column identifies the news, the second the title, the third the news text and the fourth label TRUE or FAKE.

5.1 Data link: A set of false data detection data set

5.2 Data Science Project Vision: Create a false news detection model with the Passive Aggressive Classifier algorithm. Passive Aggressive algorithm can distinguish large amounts of data, which can be used immediately.

5.3 Source Code: Python Project for False Information Discovery

6. Wine quality data set

The database contains distinct chemical information about the wine. It has 4898 variables in 14 variables each. The database is ready for segmentation and retrieval operations. The model can be used to predict the quality of wine.

6.1 Data Link: Wine quality data set

6.2 Data Science Project Vision: Develop various machine learning algorithms such as retrospect, decision tree, random forests, etc. and differentiate between models and analyze their functionality.

7. SOCR Data – Height and Weight Database

This is a simple database to start with. It contains only the length (inches) and the weight (pounds) of 25,000 different people aged 18 years. This database can be used to create a model that can predict a person’s height or weight.

7.1 Data Link: A set of data length and weight

7.2 Data Science Project Vision: Create a predictive model for determining a person’s height or weight. Use the linear regression model that will be used to predict length or weight.

8. Parkinson Dataset

Parkinson’s is a neurological disorder that affects movement. The database contains 195 records of people with 23 different characteristics that contain biomedical measurements. The data are used to distinguish healthy people from people with Parkinson’s disease.

8.1 Data Link: Parkinson’s data set

8.2 Data Science Project Vision: The model can be used to differentiate healthy people from people with Parkinson’s disease. A useful algorithm for this purpose is XGboost which represents an extremely high gradient, based on decision trees.

8.3 Source Code: Parkinson’s Diagnostic Machine Learning Project

9. Titanic data set

On April 15, 1912, the unmanned Titanic sank and sank 1502 of the 2224 passengers. The data set contains information such as name, age, gender, number of siblings who rode, etc. for approximately 891 passengers in the training set and 418 passengers in the test set. .

9.1 Data Link: Titanic data set

9.2 Data Science Project Vision: Create an exciting model for predicting whether a person will survive the Titanic or not. You can use line rotation for this purpose.

10. Uber Pickups data set

The database contains 4.5 million uber seizures in New York City from April 2014 to September 2014 and an additional 14 million from January 2015 to June 2015. Users can analyze data and collect data from data.

10.1 Data Link: A set of Uber pickups data

10.2 Data Science Project Vision: Analyze customer rides data and visualize data for information that can help improve a business. Data analysis and observation are an important part of data science. They are used to collect data from data and with a snapshot you can get instant information from data.

10.3 Source Code: Uber Data Analysis Project at R

11. Chars74k Dataset

The database contains image symbols used in English and Kannada languages. It has 64 sections (0-9, A-Z, a-z), 7.7k characters from nature images, 3.4k hand-drawn characters, and 62k computer-generated fonts.

11.1 Data Link: Chars 74k dataset

11.2 Data Science Project Vision: Use character recognition in native languages. Character recognition is the process of automatically identifying characters in written papers or printed texts.

12. Credit Card Fraud Recovery Detection Set

The database contains credit card transactions, labeled as fake or real. This is important for companies with trading systems to model fraudulent activities.

12.1 Data Link: A set of credit card fraud detection data

12.2 Data Science Project Vision: Use different algorithms such as decision trees, logs, and neural implant networks to see which one offers the best accuracy. Compare the results of each algorithm and understand the behavior of the models.

12.3 Source Code: Credit Card Fraud Detection Project

13 A set of data for Chatbot purposes

Database is a JSON file containing various tags such as greetings, goodbye, hospital_search, pharmacy_search, etc. Each mark contains a list of patterns that the user can ask and the answers that the chatbot can answer according to that pattern. Database is ready to understand how chatbot data works.

13.1 Data Link: Intents JSON Dataset

13.2 Data Science Project Vision: Modify and expand the data at your discretion to build and understand the functionality of chatbot in organizations. Chatbot requires you to understand the concepts of natural language processing.

13.3 Source Code: Chatbot Project on Python

14. AI Produces Face Data Set

This contains a variety of practical and real-life machine learning data sets.

14.1 Data Link: AI is generated by Faces Dataset

Natural language learning machine data sets


1. Enron Email Dataset

This Enron database is well known for its natural language processing. Contains approximately 0.5 million emails of over 150 users among whom the majority of users are senior executives of Enron. Data size is around 432Mb.

1.1 Data Link: Enron email data set

1.2 Vocational Learning Project Vision: Use k-means clustering to create a model for fraudulent activities. K-means clustering is a popular unattended learning algorithm. Divides visuals into k number groups by looking at similar patterns in data.

2. Yelp Dataset

Yelp makes their database publicly available but you must complete the form first to access the data. Contains 1.2 million tips by 1.6 million users, more than 1.2 million business features and photos of native language processing activities.

2.1 Data Link: Yelp data set

2.2 Vocational Learning Project Vision: You can create a model that can find restaurant reviews as fake or real. By processing the text and additional features in the database you can create an SVM model that can classify updates as fake or real.

3. Risk Data Set

Danger! is an American television drama where general information questions are asked about the church. The database contains 200k + questions and answers in a CSV or JSON file.

3.1 Data Link: Jeopardy data set

3.2 Vocational Learning Project Vision: We are building a quiz system and using it on a bot that can play a risk-taking game with users. The bot can be used on any platform like Telegram, discord, reddit, etc.

4. System Database Recommendation

This is a site for a collection of rich databases used in UCSD lab research projects. Contains various data sets from popular websites such as Goodreads book reviews, Amazon product reviews, bartending data, data from social media platforms, etc. used in building a recommendation program.

4.1 Data link: A set of data for recommended programs

4.2 Vocational Learning Project Vision: Create a product recommendation system like Amazon. The complimentary system can recommend you products, movies, etc. based on your interests and preferences.

4.3 Source Code: Film Recommendation System Project R

5. UCI Spambase Dataset

Setting emails as spam or non-spam is a very common and useful task. The database contains 4601 emails and 57 details about emails. You can create models to filter spam.

5.1 Data Link: UCI spambase dataset

5.2 Vocational Learning Project Vision: You can create a model that can see your emails as spam or non-spam.

6. Flickr 30k data set

The Flickr 30k database is similar to the Flickr 8k data set and contains many labeled images. This contains over 30,000 images and captions. This database is used to build more accurate models than the Flickr 8k data set.

6.1 Data link: Flickr image data set

6.2 Vocational Learning Project Vision: Use the same model from Flickr 8k and make it more accurate with additional training data. The CNN model is very good at extracting features from an image and feeds the features into a normal neural network that will generate captions.

7. IMDB Review

The big data review set contains movie reviews from the IMDB website with over 25,000 training reviews and 25,000 test sets.

7.1 Data Link: IMDB updates the data set

7.2 Vision Project Learning Project: Perform Emotional Analysis in Data to see what statistics users like about the movie. Emotional analysis is the process of analyzing text data and identifying user feelings, Good or Bad.

7.3 Source Code: Emotional Data Analysis Science Project

8. MS COCO data set

Microsoft COCO is a vast database for the acquisition, classification and caption functions of image captions. It contains about 1.5 million labeled images. Database is good for building models that are ready to produce.

8.1 Data Link: MS COCO data set

8.2 Vocational Learning Project Vision: Find the objects in the picture and make their own captions. The LSTM network (Short-term Memory) is responsible for producing sentences in English and CNN is used to extract features from the image. To create a caption generator we must combine these two models.

9. Flickr 8k data set

The Flickr 8k database contains 8000 images and each image is labeled with 5 different captions. Database is used to create an image caption generator.

9.1 Data Link: Flickr 8k data set

9.2 Vocational Learning Project Vision: Create an image caption generator using the CNN-RNN model. The image generator model is able to analyze image features and produce english as a descriptive sentence.

9.3 Source Code: Python Image Caption Project

Data sets for computer scanning and image processing


1. CIFAR-10 and CIFAR-100 data set

These are two data sets, the CIFAR-10 data set contains 60,000 small images of 32 * 32 pixels. Labeled from 0-9 and each digit represents a class. CIFAR-100 is similar to the CIFAR-10 database but the difference is that it has 100 categories instead of 10. This database is ready to use image classification.

1.1 Data link: CIFAR data set

1.2 The Artificial Intelligence Project Vision: Perform image classification and create a model. In classifying images, we take the image as embedded and the goal is to classify the image at which category.

2. GTSRB (German road sign recognition benchmark) Data set

The GTSRB database contains approximately 50,000 images of traffic signals for 43 different categories and contains information in the compiling box for each symbol. Database is used for classification.

2.1 Data Link: GTSRB data set

2.2 Vision Project Performance Vision: Build a model using an in-depth reading framework that separates traffic signals and recognizes an integral box of symbols. Separating traffic signs also helps private vehicles to recognize signs and take appropriate action.

2.3 Source Code: Python Traffic Recognition Project

3. ImageNet data set

ImageNet is a large image site organized by wordnet. It has more than 100,000 sentences and an average of 1000 images per phrase. Size exceeds 150 GB. It deserves photo attention, facial recognition, object detection, etc. It also hosts a challenging competition called ILSVRC to get people to create more accurate models.

3.1 Data link: Imagenet data set

3.2 Artificial Intelligence Project Vision: Implement image classification on a large site and monitor objects. CNN (Convolutional neural networks) model is required for this project to get accurate results.

4. Data Set for Breast Histopathology Images

This database contains 2,77,524 size 50 × 50 images extracted from 162 slide images of breast cancer models scanned 40x. There are 1,98,738 tests that you do not have and 78,786 78,786 IDC tests.

4.1 Data Link: A set of breast histopathology data

4.2 Artificial Intelligence Project Vision: Creating a model that can differentiate breast cancer. Creates an image separation model with Convolutional neural networks.

4.3 Source Code: Python Breast Cancer Screening Project

5. Cityscapes data set

This is an open source data source for Computer Vision project projects. Contains high-resolution pixels for video sequences taken from 50 different city streets. The database is useful for semantic segregation and for training deep neural networks to understand urban status.

5.1 Data Link: Cityscapes data set

5.2 Artificial Intelligence Project Vision: Performing image classification and finding different objects in video on the street. Image segmentation is the process of digitizing a digital image into different categories such as cars, buses, people, trees, roads, etc.

6. Kinetics data set

There are three different Kinetics databases: Kinetics 400, Kinetics 600 and Kinetics 700 data set.

6.1 Data Link: A set of Kinetics data

6.2 The vision of the Artificial Intelligence project: Create a model to recognize human action and find human action. Recognition of human action is reflected in a series of observations.

7. MPII human database status

The MPII human data set contains 25,000 images of more than 40,000 people with annotations. The total data collection includes more than 410 human activities. The database is 12.9 GB in size.

7.1 Data link: MPII data set for personal status

7.2 Artificial Intelligence Project Vision: Identifying a different personal shape based on the alignment of the human body parts. The discovery of a person’s posture tracks all body movements. It is also known as the location of human organs.

8. 20BN-something-database v2

This is a data set of high quality video clips that reflect human actions such as selecting, setting, opening, closing, etc.

It has a total of 220,847 videos.

8.1 Data Link: A specific data set

8.2 Artificial Intelligence Project Vision: Using a model to recognize human action and to detect various human activities. Jobs can be used to find jobs while driving, guard jobs, etc.

9. Object 365 data set

Database 365 is a large collection of high quality images with compact boxes. It has 365 items, 600k photos, and 10 million cover boxes. This is good for making models to find something.

9.1 Data Connector: Object 365 data set

9.2 Artificial Intelligence Project Vision: Arrange images taken from the camera and find objects in the image.

Object discovery works by identifying what object is in the image and the links of the object.

10. Image data set

The data set contains paired images and their link diagrams. It has 1000 exterior designs, each image has 5 concrete drawings representing the frame of the image.

10.1 Data link: Data set for recording

10.2 Project Vision Project Vision: Create a model that can automatically create drawings in pictures. This will take the image as input and produce the image drawn using computer recognition techniques.

11. CQ500 Database data

This database is publicly available with 491 head CT scans with 193,317 fragments. It contains the ideas of three different radiologists in each picture. The database can be used to create models that can detect bleeding, fractures and severe headaches.

11.1 Data Link: CQ 500 data set

11.2 Artificial Intelligence Project Vision: Create a hospital model that can automatically generate a report of a fracture, bleeding or other substance by analyzing the CT scan database.

12. IMDB-Wiki data set

The IMDB-Wiki data set is one of the largest open source data sets for faces with age and gender. Images are collected from IMDB and Wikipedia. Contains 5 million labeled images.

12.1 Data Link: IMDB data set wiki

12.2 Artificial Intelligence Project Vision: Create a model that will see faces and predict their gender and age. You can have categories in different ranges such as 0-10, 10-20, 30-40, 50-60, etc.

12.3 Source Code: Python Project for Gender and Age

46237983141 752a6c3a47 b

13. Color Detection Data Set

The database contains a CSV file containing 865 colored words and corresponding RGB (red, blue) color values. It also has a hexadecimal color value.

13.1 Data Link: Color Detection Set

13.2 Artificial Intelligence Project Idea: Color Database can be used to create a color recognition app where we can have a visual connector to select a color in the image and the app to display the color name.

13.3 Source Code: Python Color Recovery Project

Machine learning data sets for in-depth reading


1. Youtube 8M Dataset

The youtube 8M data set is a massive video database with 6.1 million YouTube video IDs, 350,000 hours of video, 2.6 billion audio features, 3862 classes and 3avg labels per video. Used for video segmentation purposes.

1.1 Data Link: Youtube 8M

1.2 Vocational Learning Project Vision: Video editing can be done using a database and the model can explain what the video is about. The video takes a series of inputs to separate which category the video belongs to.

2. Urban Sound 8K data set

The urban sound data set contains 8732 urban sounds from 10 classes like cool, dog barking, piercing, alarm, street music, etc. The database is popular for urban audio classification problems.

2.1 Data Connector: Urban Sound 8K data set

2.2 Vision Project Learning Project: We can build a sound classification system to detect the type of urban noise that is playing in the background. This will help you get started with audio data and understand how it works with random data.

3. LSUN data set

Large scale scene understanding (LSUN) is a data collection of millions of color images and scenes. Larger than a set of image data. There are about 59 million images, 10 episodes categories, and 20 different object categories.

3.1 Data Link: LSUN data set

3.2 Vocational Learning Project Vision: Create a model to find out which scene is in the picture. Example – classroom, bridge, bedroom, currch_outdoor, etc. The goal of understanding a scene is to gather as much information as possible of a given landscape. Includes classification, object acquisition, object classification.

4. RAVDESS data set

RAVDESS is a summary of the Ryerson Audio-Visual Database of Emotional Speech and Song. Contains audio files for 24 players (12 men, 12 women) with different emotions such as calm, anger, sadness, joy, fear, etc. Speeches have both normal and strong tension. Database is useful for monitoring speech emotions.

4.1 Data Link: RAVDESS data set

4.2 Vocational Learning Project Vision: Create a sensory awareness section to get the speaker’s emotions. People’s audio clips are divided into emotions such as anger, joy, sadness, etc.

4.3 Source Code: Python Emotional Awareness Project

5. Librispeech data set

This database contains a large number of English expressions taken from the LibriVox project. It has 1000 hours of English reading speech with various pronunciations. Used for speech recognition projects.

5.1 Data Link: Librispeech data set

5.2 Vocational Learning Project Vision: Create a speech recognition model to find the meaning and translate it into text. The purpose of speech recognition is to automatically identify what is being said in the audio.

6. Baidu Apolloscape data set

The database is designed to promote the development of self-driving technology. Contains high-definition video with hundreds of thousands of frames and annotations of pixels, stereo image, dense cloud point, etc. The database contains 25 different semantic objects such as cars, pedestrians, bicycles, street lights, etc.

6.1 Data Link: Baidu apolloscape data set

6.2 Vocational Learning Vision: Build a self-propelled robot that can detect a variety of objects on the road and take appropriate action. The model can distinguish the objects in the image that will help prevent conflicts and make their way.korean to english translation

Electronic and Economic Machine Learning Database


1. quandl Data Portal

Quandl is a vast repository of economic and financial data. Some data sets are free while there are also data sets that need to be purchased. Great quantity and good data make this platform even better at finding data sets for models that are ready to produce.

2. World Bank Open Data Portal

The World Bank is a global development agency that provides loans in developing countries. It contains big data for its entire system and is publicly available for us. It has a lot of missing values ​​and you can get real world data information.

2.1 Data link: Global banking data sets are open

3. IMF Data Portal

The IMF is an international fund that publishes international financial data, debt estimates, investments, and foreign exchange reserves and commodities.

3.1 Data Link: IMF data sets

4. American Economic Organization (AEA) Data Agency Portal.

The American economic organization has rich online data and is a good source for US macroeconomic data.

4.1 Data Link: AEA data sets

5. Google Trends Data Portal

Google Trend Data can be used to view and analyze data by view. You can also download the database to CSV files with a simple click. We can find out what the trend is and what people want.

Data link: Google data sets trends

6. Financial Markets Data Portfolio

Seasonal market data is a good tool for obtaining the latest information about financial markets worldwide. You can find indicators of stock prices, commodities, and foreign exchange

Data Link: Market Database for financial times

Public Government machine learning data sets


1. Data.gov Portal

This site is home to open data from the US government. You can find data on various domains such as agriculture, health, climate, education, energy, finance, science, and research, etc. Many software applications use the website to collect data and build consumer products.

1.1 Data Link: Data.gov sets

2. Data Portal: Open government data (India)

An open government data forum gives us access to shared government data. It is part of India’s digital system and was developed by an open source stack. Publishes multiple data sets, tools, APIs, etc.

2.1 Data link: Open government data sets

3. Atlas Data Portal Restaurant

The forum contains US food data and how US local food affects human consumption. Contains information on food selection research and food quality that will help determine the availability of healthy food choices.

3.1 Data Link: Data atlas of natural food atlas

4. Health Data Portal

This is a site of the US Department of Health and Human Services. It contains more than 3000 important datasets available. And they have our API.

4.1 Data link: Health data sets

5. Disease Control and Prevention Center Data Center

The CDC has a variety of health-related databases such as diabetes, cancer, obesity, etc. There are additional resources where you can get data on health issues.

5.1 Data link: CDC statistical data sets

6. London Datastore Portal

This contains data about the health of the people of London. For example – how much the population has increased in 5 years or the number of tourists visiting London. They have more than 700 data sets for information about the city of London.

6.1 Data Link: London Database

7. Canadian Government Open Data Portal

This is a data site related to Canadians. You can find data sets related to topics such as agriculture, arts, music, education, government, health, etc.

7.1 Data link: The Government of Canada has opened data sets


In this article, we have identified more than 70 machine learning databases that you can use to train yourself in machine learning or data science. Creating your own database is expensive so we can use other people’s databases to get our work done. But we should read the data documents carefully because some data sets are free, while in other data sets you have to give the owner credit as he says.

March ahead of everyone with practice 130+ Data Science Interview Questions

If you would like to add any other machine data sets, share them in the comments section. I hope this article has been clever for you.

We work hard to provide you with quality goods

Can you take 15 seconds and share your fun experience on Google | Facebook on 24 x 7 offshoring

Leave a Comment

Table of Contents