Open In App

Datasets in Keras

Last Updated : 17 Jul, 2020
Improve
Improve
Like Article
Like
Save
Share
Report

Keras is a python library which is widely used for training deep learning models. One of the common problems in deep learning is finding the proper dataset for developing models. In this article, we will see the list of popular datasets which are already incorporated in the keras.datasets module.

MNIST (Classification of 10 digits):
This dataset is used to classify handwritten digits. It contains 60,000 images in the training set and 10,000 images in the test set. The size of each image is 28×28.




from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()


Returns:

  • x_train, x_test: An unsigned integer(0-255) array of grayscale image data with shape (num_samples, 28, 28).
  • y_train, y_test: An unsigned integer(0-255) array of digit labels (integers in range 0-9) with shape (num_samples,).

Fashion-MNIST(classification of 10 fashion categories):

This dataset can be used as a drop-in replacement for MNIST. It consists of 60,000 28×28 grayscale images of 10 fashion categories, along with a test set of 10,000 images. The class labels are:

Label Description
0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot




from keras.datasets import fashion_mnist
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()


Returns:

  • x_train, x_test: An unsigned integer(0-255) array of grayscale image data with shape (num_samples, 28, 28).
  • y_train, y_test: An unsigned integer(0-255) array of digit labels (integers in range 0-9) with shape (num_samples,).

CIFAR10 (classification of 10 image labels):

This dataset contains 10 different categories of images which are widely used in image classification tasks. It consists of 50,000 32×32 color training images, labeled over 10 categories, and 10,000 test images. The dataset is divided into five training batches , each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. The class labels are:

Label Description
0 airplane
1 automobile
2 bird
3 cat
4 deer
5 dog
6 frog
7 horse
8 ship
9 truck




from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()


Returns:

  • x_train, x_test: An unsigned integer(0-255) array of RGB image data with shape (num_samples, 3, 32, 32) or (num_samples, 32, 32, 3) based on the image_data_format backend setting of either channels_first or channels_last respectively. The value “3” in the shape refers to the 3 RGB channels.
  • y_train, y_test: An unsigned integer(0-255) array of category labels (integers in range 0-9) with shape (num_samples, 1).

CIFAR100 (classification of 100 image labels):

This dataset contains 10 different categories of images which are widely used in image classification tasks. It consists of 50,000 32×32 colour training images, labelled over 10 categories, and 10,000 test images. This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).




from keras.datasets import cifar100
(x_train, y_train), (x_test, y_test) = cifar100.load_data(label_mode='fine')


Returns:

  • x_train, x_test: An unsigned integer(0-255) array of RGB image data with shape (num_samples, 3, 32, 32) or (num_samples, 32, 32, 3) based on the image_data_format backend setting of either channels_first or channels_last respectively. The value “3” in the shape refers to the 3 RGB channels.
  • y_train, y_test: An unsigned integer(0-255) array of category labels (integers in range 0-99) with shape (num_samples, 1).

Arguments:

  • label_mode: “fine” or “coarse”.

Boston Housing Prices(Regression):

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. This dataset contains 13 attributes of houses at different locations around the Boston suburbs in the late 1970s. Targets are the median values of the houses at a location (in k$). The training set contains data of 404 different households while the test set contains data of 102 different households




from keras.datasets import boston_housing
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()


Returns:

  • x_train, x_test: A numpy array of values of different attributes with shape (num_samples, 13) .
  • y_train, y_test: A numpy array of values of different attributes with shape (num_samples, ).

Arguments:

  • seed: Random seed for shuffling the data before computing the test split.
  • test_split: fraction of the data to reserve as test set.

IMDB Movie Reviews (Sentiment Classification) :

This dataset is used for binary classification of reviews i.e, positive or negative. It consists of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). These reviews have already been preprocessed, and each review is encoded as a sequence of word indexes (integers). These words are indexed by overall frequency of their presence in the dataset. For example, the integer “5” encodes the 5th most frequent word in the data. This allows for quick filtering operations such as considering only the top 5000 words as the model vocabulary etc..




from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data()


Returns:

  • x_train, x_test: list of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words-1. If the maxlen argument was specified, the largest possible sequence length is maxlen.
  • y_train, y_test: list of integer labels (1 for positive or 0 for negative).

Arguments:

  • num_words(int or None): Top most frequent words to consider. Any less frequent word will appear as “oov_char” value in the sequence data.
  • skip_top(int): Top most frequent words to ignore (they will appear as oov_char value in the sequence data).
  • maxlen(int): Maximum sequence length. Any longer sequence will be truncated.
  • seed(int): Seed for reproducible data shuffling.
  • start_char(int): The start of a sequence will be marked with this character. Set to 1 because 0 is usually the padding character.
  • oov_char(int): words that were cut out because of the num_words or skip_top limit will be replaced with this character.
  • index_from(int): Index actual words with this index and higher.

Reuters newswire topics classification:

This dataset is used for multiclass text classification. It consists of 11,228 newswires from Reuters, labelled over 46 topics. Just like the IMDB dataset, each wire is encoded as a sequence of word indexes (same conventions).




from keras.datasets import reuters
(x_train, y_train), (x_test, y_test) = reuters.load_data()


Returns:

  • x_train, x_test: list of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words-1. If the maxlen argument was specified, the largest possible sequence length is maxlen.
  • y_train, y_test: list of integer labels (1 for positive or 0 for negative).

Arguments:

  • num_words(int or None): Top most frequent words to consider. Any less frequent word will appear as “oov_char” value in the sequence data.
  • skip_top(int): Top most frequent words to ignore (they will appear as oov_char value in the sequence data).
  • maxlen(int): Maximum sequence length. Any longer sequence will be truncated.
  • seed(int): Seed for reproducible data shuffling.
  • start_char(int): The start of a sequence will be marked with this character. Set to 1 because 0 is usually the padding character.
  • oov_char(int): words that were cut out because of the num_words or skip_top limit will be replaced with this character.
  • index_from(int): Index actual words with this index and higher.


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads