Top 10 Dataset in Machine Learning


Everything can be represented by data making it an essential part of both computing and Machine Learning. The efficiency of Machine Learning relies heavily on its datasets to perform properly. But how do you determine which data set is the best for your project?  Here’s a list of the top 10 free and easily accessible online Machine Learning datasets.


msleep Dataset

Contains the sleep time and weight of specific mammals and consists of 11 variables and can be used to understand mammalian sleeping patterns.

  1. name: common name

  2. genus: taxonomic rank

  3. vore: carnivor, omnivore or herbivor

  4. order: taxonomic rank

  5. conservation: status of the mammal 

  6. sleep_total: total amount of sleep measured in hours

  7. sleep_rem: rem sleep measured in hours

  8. sleepy_cycle: length of sleep cycle measured in hours

  9. awake: time spent awake measured in hours

  10. brainwt: brain weight in kilograms

  11. boydwt: body weight in kilograms 


Car Seat Dataset

Consists of the sales of car seats from 400 different store locations with 11 variables.  Each of the following variables are measured in increments of thousands. 

  1. Sales: unit sales at each location 

  2. CompPrice: Price charged by competitor at each location

  3. Income: Community income level measured in thousands of dollars

  4. Advertising: Local advertising budget for the company at each location 

  5. Population: Population size in region 

  6. Price: Price the company charges for car seats at each site

  7. ShelveLoc: Measured in Bad, Good and Medium indicating the quality of the shelving location for the car seats at each location

  8. Age: Average age of the local population

  9. Education: Education level at each location

  10. Urban: Yes/ No to indicate if the store is in an urban or rural location

  11. US: Yes/No to indicate if the store is in the US or not

Diamond Dataset

Contains information regarding almost 54,000 diamonds with ten variables.

  1. Carat: weight of the diamond

  2. Cut: quality of the diamond measured from  Fair, Good, Very Good, Premium, Ideal

  3. Color: color of the diamond measured from D, the best, to J, the worst

  4. Clarity: how clear the diamond measured by the following scale (worst to best): I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF 

  5. Depth: total depth percentage, calculated using the x, y and z variables

  6. Table: width of top of the diamond in relation to its widest point 

  7. Price: amount in USD

  8. X: length in millimeters

  9. Y: width in millimeters

  10. Z: length in millimeters


Free Spoken Digit Dataset

This dataset allows you to contribute your recordings of spoken digits as long as they are 8kHz wav files and in English. The recordings are also trimmed at the beginning and end for minimal silence.  As an open dataset, it is expected to grow over time as contributions trickle in.  This dataset hopes to solve digit pronunciation problems and at the time of this post, consists of six speakers, with 3,000 recordings (50 of each digit per speaker).  


The Wikipedia Corpus

Wikipedia is not only a resource for students with research papers, but also a very useful tool for Natural Language Processing researchers.  This dataset consists of nearly 1.9 billion words from more than 4 million Wikipedia articles that can be searched by words, phrases, and paragraphs. 


Face Image Dataset   

Subjects of this dataset consist mostly of male and female adults, ranging between the ages of 18-20 years old, from various ethnicities. The objective of this dataset is to help distinguish not only between genders but also emotions. As part of the dataset, images with a resolution of 180*200 pixels were taken of the female and female subjects. In total, nearly 400 individuals participated with 20 images taken per each subject. Now, anyone can download this dataset as a zip file.   


Spam SMS Classifier Dataset 

Ham or spam?  This dataset helps predict whether a text is ham (legit) or spam.  Consisting of more than 5,500 messages in English, this dataset is beginner-friendly and simple to comprehend.  By using a comma-separated value format and one message per line made up of two columns: v1, ham or spam, and v2, the raw text this data set is novice approved.   


Fashion MNIST Dataset

Like the Spam SMS Classifier dataset, this dataset is beginner-friendly and useful in understanding the techniques and deep learning recognition pattern of real-world data.  With over 70,000, 28x28, grayscale pixel images, this set was created to replace the original MNIST dataset to become the new benchmark for algorithms.  In this dataset each pixel has a pixel-value integer running from 0 to 255 associated with it, the bigger numbers representing the darkest pixel.


Breast Cancer Wisconsin (Diagnostic) Dataset 

Used often to help with classification problems in machine learning, this dataset describes the cell nuclei characteristics present in the image with the following real-valued features:

  1. Radius 

  2. Texture (standard deviation of gray-scale values)

  3. Perimeter 

  4. Area

  5. Smoothness 

  6. Compactness  (perimeter^2 / area - 1.0) 

  7. Concavity

  8. Concave points

  9. Symmetry

  10. Fractal dimension 


Iris Flower Dataset 

Used by R.A. Fisher, statistical science genius, in 1936 this dataset can still be used to build simple projects in machine learning algorithms and is beginner-friendly.  The dataset is small and consists of four attributes all measured in centimeters: sepal length, sepal width, petal length and petal width with three classes: Virginica, Setosa and Versicolor. 


Creating datasets for machine learning is a laborious human task, but luckily there are several public datasets available.  The datasets mentioned above are user-friendly, but rest assured there are plenty of other accessible datasets available for use, regardless of your project or use case.







7 views
Accounts
About
Community
Enterprise
facebook-logo-png-white-i6.png
white-instagram-logo-transparent-backgro
youtube-logo-white-300x283.png
telegram.png
linkedIn_PNG21.png

Facebook

Twitter

Instagram

GitHub

YouTube

Telegram

Discord

 

Aphid logo is a trademark of Aphidbyte LLC. The App Store, IPad, And IPhone Marks Are Trademarks Of Apple, Inc. Twitter Is A Trademark Of Twitter, Inc. Google And Android Are Trademarks Of Google, Inc. Windows Phone Is A Registered Trademark Of Microsoft, Inc. Instagram Is A Trademark of Instagram Inc. Facebook Is A Trademark of Facebook Inc. Telegram Is A Trademark of Telegram Inc.

Made In Silicon Beach | Los Angeles, California

© 2020 by AphidByte LLC. All Rights Reserved.

WLS Patent Pending