Datasets in Keras

Keras is a python library which is widely used for training deep learning models. One of the common problems in deep learning is finding the proper dataset for developing models. In this article, we will see the list of popular datasets which are already incorporated in the keras.datasets module.

MNIST (Classification of 10 digits):
This dataset is used to classify handwritten digits. It contains 60,000 images in the training set and 10,000 images in the test set. The size of each image is 28×28.

filter_none

edit
close

play_arrow

link
brightness_4
code

from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

chevron_right


Returns:

  • x_train, x_test: An unsigned integer(0-255) array of grayscale image data with shape (num_samples, 28, 28).
  • y_train, y_test: An unsigned integer(0-255) array of digit labels (integers in range 0-9) with shape (num_samples,).

Fashion-MNIST(classification of 10 fashion categories):

This dataset can be used as a drop-in replacement for MNIST. It consists of 60,000 28×28 grayscale images of 10 fashion categories, along with a test set of 10,000 images. The class labels are:



Label Description
0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot
filter_none

edit
close

play_arrow

link
brightness_4
code

from keras.datasets import fashion_mnist
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

chevron_right


Returns:

  • x_train, x_test: An unsigned integer(0-255) array of grayscale image data with shape (num_samples, 28, 28).
  • y_train, y_test: An unsigned integer(0-255) array of digit labels (integers in range 0-9) with shape (num_samples,).

CIFAR10 (classification of 10 image labels):

This dataset contains 10 different categories of images which are widely used in image classification tasks. It consists of 50,000 32×32 color training images, labeled over 10 categories, and 10,000 test images. The dataset is divided into five training batches , each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. The class labels are:

Label Description
0 airplane
1 automobile
2 bird
3 cat
4 deer
5 dog
6 frog
7 horse
8 ship
9 truck
filter_none

edit
close

play_arrow

link
brightness_4
code

from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

chevron_right


Returns:

  • x_train, x_test: An unsigned integer(0-255) array of RGB image data with shape (num_samples, 3, 32, 32) or (num_samples, 32, 32, 3) based on the image_data_format backend setting of either channels_first or channels_last respectively. The value “3” in the shape refers to the 3 RGB channels.
  • y_train, y_test: An unsigned integer(0-255) array of category labels (integers in range 0-9) with shape (num_samples, 1).

CIFAR100 (classification of 100 image labels):

This dataset contains 10 different categories of images which are widely used in image classification tasks. It consists of 50,000 32×32 colour training images, labelled over 10 categories, and 10,000 test images. This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).

filter_none

edit
close

play_arrow

link
brightness_4
code

from keras.datasets import cifar100
(x_train, y_train), (x_test, y_test) = cifar100.load_data(label_mode='fine')

chevron_right


Returns:

  • x_train, x_test: An unsigned integer(0-255) array of RGB image data with shape (num_samples, 3, 32, 32) or (num_samples, 32, 32, 3) based on the image_data_format backend setting of either channels_first or channels_last respectively. The value “3” in the shape refers to the 3 RGB channels.
  • y_train, y_test: An unsigned integer(0-255) array of category labels (integers in range 0-99) with shape (num_samples, 1).

Arguments:



  • label_mode: “fine” or “coarse”.

Boston Housing Prices(Regression):

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. This dataset contains 13 attributes of houses at different locations around the Boston suburbs in the late 1970s. Targets are the median values of the houses at a location (in k$). The training set contains data of 404 different households while the test set contains data of 102 different households

filter_none

edit
close

play_arrow

link
brightness_4
code

from keras.datasets import boston_housing
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()

chevron_right


Returns:

  • x_train, x_test: A numpy array of values of different attributes with shape (num_samples, 13) .
  • y_train, y_test: A numpy array of values of different attributes with shape (num_samples, ).

Arguments:

  • seed: Random seed for shuffling the data before computing the test split.
  • test_split: fraction of the data to reserve as test set.

IMDB Movie Reviews (Sentiment Classification) :

This dataset is used for binary classification of reviews i.e, positive or negative. It consists of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). These reviews have already been preprocessed, and each review is encoded as a sequence of word indexes (integers). These words are indexed by overall frequency of their presence in the dataset. For example, the integer “5” encodes the 5th most frequent word in the data. This allows for quick filtering operations such as considering only the top 5000 words as the model vocabulary etc..

filter_none

edit
close

play_arrow

link
brightness_4
code

from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data()

chevron_right


Returns:

  • x_train, x_test: list of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words-1. If the maxlen argument was specified, the largest possible sequence length is maxlen.
  • y_train, y_test: list of integer labels (1 for positive or 0 for negative).

Arguments:

  • num_words(int or None): Top most frequent words to consider. Any less frequent word will appear as “oov_char” value in the sequence data.
  • skip_top(int): Top most frequent words to ignore (they will appear as oov_char value in the sequence data).
  • maxlen(int): Maximum sequence length. Any longer sequence will be truncated.
  • seed(int): Seed for reproducible data shuffling.
  • start_char(int): The start of a sequence will be marked with this character. Set to 1 because 0 is usually the padding character.
  • oov_char(int): words that were cut out because of the num_words or skip_top limit will be replaced with this character.
  • index_from(int): Index actual words with this index and higher.

Reuters newswire topics classification:

This dataset is used for multiclass text classification. It consists of 11,228 newswires from Reuters, labelled over 46 topics. Just like the IMDB dataset, each wire is encoded as a sequence of word indexes (same conventions).

filter_none

edit
close

play_arrow

link
brightness_4
code

from keras.datasets import reuters
(x_train, y_train), (x_test, y_test) = reuters.load_data()

chevron_right


Returns:

  • x_train, x_test: list of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words-1. If the maxlen argument was specified, the largest possible sequence length is maxlen.
  • y_train, y_test: list of integer labels (1 for positive or 0 for negative).

Arguments:

  • num_words(int or None): Top most frequent words to consider. Any less frequent word will appear as “oov_char” value in the sequence data.
  • skip_top(int): Top most frequent words to ignore (they will appear as oov_char value in the sequence data).
  • maxlen(int): Maximum sequence length. Any longer sequence will be truncated.
  • seed(int): Seed for reproducible data shuffling.
  • start_char(int): The start of a sequence will be marked with this character. Set to 1 because 0 is usually the padding character.
  • oov_char(int): words that were cut out because of the num_words or skip_top limit will be replaced with this character.
  • index_from(int): Index actual words with this index and higher.

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


1


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.