How to Use MNIST Dataset

The MNIST dataset is a popular dataset used for training and testing in the field of machine learning for handwritten digit recognition. The article aims to explore MNIST dataset, it's characteristics and it's significance in machine learning.

Table of Content

What is MNIST Dataset?
Characteristics of MNIST dataset
Origin of the MNIST Dataset
Methods to load MNIST dataset in Python
1. Loading MNIST dataset using TensorFlow/Keras
2. Loading MNIST dataset Using PyTorch
Significance of MNIST in Machine Learning
Applications of MNIST

What is MNIST Dataset?

The MINST dataset stands for "Modified National Institute of Standards and Technology". The dataset contains a large collection of handwritten digits that is commonly used for training various image processing systems. The dataset was created by re-mixing samples from NIST's original datasets, which were taken from American Census Bureau employees and high school students. It is designed to help scientists develop and test machine learning algorithms in pattern recognition and machine learning. It contains 60,000 training images and 10,000 testing images, each of which is a grayscale image of size 28x28 pixels.

Characteristics of MNIST dataset

The MNIST dataset is a collection of 70,000 handwritten digits (0-9), with each image being 28x28 pixels. Here is the dataset information in the specified format:

Number of Instances: 70,000 images
Number of Attributes: 784 (28x28 pixels)
Target: Column represents the digit (0-9) corresponding to the handwritten image
Pixel 1-784: Each pixel value (0-255) represents the grayscale intensity of the corresponding pixel in the image.
The dataset is divided into two main subsets:
1. Training Set: Consists of 60,000 images along with their labels, commonly used for training machine learning models.
2. Test Set: Contains 10,000 images with their corresponding labels, used for evaluating the performance of trained models.

Origin of the MNIST Dataset

The MNIST dataset, which currently represents a primary input for many tasks in image processing and machine learning, can be traced back to the National Institute of Standards and Technology (NIST). NIST, a US government agency focused on measurement science and standards, curates various datasets, including two particularly relevant to handwritten digits:

Special Database 1 (SD-1): Since being the Bureau of US census employees with sizable population among the workplace was private handwritten data - they all came from a desirable source. Census staff are seen handling written values on a repeat basis, thus rendering their samples a high chance of success in algorithm training.
Special Database 3 (SD-3): This data set contained digitized handwriting figures of high-schoolers, provided by students. However, in terms of authenticity, this information looked less "official" than the numbers provided by the Census Bureau, but the great thing is that they applied in a variety of writing styles.

While these datasets existed, unfortunately, they could not be used directly and instead, they had to be transformed and divided into specifically data for training and testing the AI models. The separation between the two NIST collections created a potential bias:

SD-1 was then kept aside as a teaching set. The AI problem can be attributed to the fact that the technicians having more experience in writing the hand-written numbers. So the model might go on to become overly biased towards such "clean" numbers.
In SD-3 we assigned it to do the test runs. Without being exposed to more types of write styles during training (if only from SD-1), the model may misguided on SD-3 testing.

To tackle this bias and get a more balanced data set for machine learning, the MNIST developers used an original trick of combining characters from NIST Special databases and symbols from a such font as Zapf Dingbats. By using this approach, the data used for both training and testing became more inclusive of the wide range of alphabets used, thereby resulting in more generally applicable data processing and machine learning models.

Methods to load MNIST dataset in Python

Loading the MNIST dataset in Python can be done in several ways, depending on the libraries and tools you prefer to use. Below are some of the most common methods to load the MNIST dataset using different Python libraries:

Loading the MNIST dataset using TensorFlow/Keras
Loading MNIST dataset using PyTorch

Loading MNIST dataset using TensorFlow/Keras

This code snippet loads the MNIST dataset using Keras, retrieves the training images and labels, and then plots four images in a row with their corresponding labels. Each image is displayed in grayscale.

Python

from tensorflow.keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np

# Load the MNIST dataset
(X_train, y_train), (_, _) = mnist.load_data()

# Print 4 images in a row
plt.figure(figsize=(10, 5))
for i in range(4):
    plt.subplot(1, 4, i+1)
    plt.imshow(X_train[i], cmap='gray')
    plt.title(f"Label: {y_train[i]}")
    plt.axis('off')
plt.tight_layout()
plt.show()

Output:

download-(39)

Loading MNIST dataset Using PyTorch

PyTorch offers a similar utility through torchvision.datasets, which is very convenient, especially when combined with torchvision.transforms to perform basic preprocessing like converting images to tensor format.

Python

import matplotlib.pyplot as plt
import torch
from torchvision import datasets, transforms

# Define the transformation to convert images to PyTorch tensors
transform = transforms.Compose([transforms.ToTensor()])

# Load the MNIST dataset with the specified transformation
mnist_pytorch = datasets.MNIST(root='./data', train=True, download=True, transform=transform)

# Create a DataLoader to load the dataset in batches
train_loader_pytorch = torch.utils.data.DataLoader(mnist_pytorch, batch_size=1, shuffle=False)

# Create a figure to display the images
plt.figure(figsize=(15, 3))

# Print the first few images in a row
for i, (image, label) in enumerate(train_loader_pytorch):
    if i < 5:  # Print the first 5 samples
        plt.subplot(1, 5, i + 1)
        plt.imshow(image[0].squeeze(), cmap='gray')
        plt.title(f"Label: {label.item()}")
        plt.axis('off')
    else:
        break  # Exit the loop after printing 5 samples

plt.tight_layout()
plt.show()

Output:

download-(40)

Significance of MNIST in Machine Learning

MNIST is a starter dataset used for machine learning for several reasons:

Benchmarking: It provides a straightforward dataset to test and benchmark machine learning models, particularly in image recognition algorithms.
Learning Tool: Due to its simplicity and small size, MNIST is an excellent dataset for beginners to learn the basics of machine learning and pattern recognition.
Research: It continues to be a reference data set for evaluating new machine learning techniques.

Applications of MNIST

While it's primarily used for educational purposes and in benchmarking algorithms in academic studies, learning and experimenting with the MNIST dataset can also have practical applications. MNIST dataset finds applications in the Banking Sector, Postal Services, and Document Management:

Banking Sector
- Recognizing Handwritten Numbers on Checks: The banks are primarily responsible for this role, namely, cashing the checks. MNIST is the core of training numerical recognition systems in classifying the digits to identify the ones or the amount on a check. Thus, this removes data entry, eliminates error, and expedites check handling.
Postal Services
- Automating Postal Code Reading: Accurate parcel sorting and timely postal delivery depend a lot on proper recognition of a postal code. The MNIST data set is used to train an image recognition model recognizing zip codes on envelopes regardless of varied hand writing quality and print quality. This results in the rapid links of the sorting and postage which would ultimately facilitate fast delivery hence reducing delays.
Document Management
- Digitizing Written Documents and Recognizing Numbers: A lot of them have handwriting numbers too; these are the invoices, receipts, and forms. MNIST can be applied in developing such systems that can perform operations like extracting and recognizing those figures during the scan and digitization process. The benefits of data entry automation are the opportunities to streamline the process, simplify the data mining, and increase the documents searchability.

Conclusion

MNIST dataset ranks among initial databases that have been critical in developing the field of machine learning and image processing. Ease, openness, and accuracy of it are the main characteristic of it, which turn it into a good platform for starters to learn the subject of image classification and artificial neural networks. Besides that, MNIST is a very effective standard for researchers, as it enables them to assess the efficiency of different methods, i. e. comparing them with each another to understand what algorithms will be more successful for the case. MNIST is frequently used as a tool for the training of algorithms involved in digital recognition of objects, and the techniques developed on this task are applicable in solving more complex tasks of images processing. With the machine learning field being in its continuous development, the MNIST dataset will no doubt remain a defining work for education, research, and development in these aspects.

Frequently Asked Questions about the MNIST Dataset

1. What is the MNIST dataset?

It is a collection of handwritten digit widely used for training and testing. It contains 70,000 images of handwritten digits from 0 to 9,

2. How can I download the MNIST dataset?

The MNIST dataset can be downloaded from several sources. A common method is to use Python libraries that facilitate machine learning. For example, with TensorFlow or PyTorch, you can download MNIST directly through their dataset utilities.

3. How do I load the MNIST dataset using TensorFlow?

In TensorFlow, you can easily load the MNIST dataset with the following code:
 from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

4. What is the size of the MNIST dataset?

The MNIST dataset contains a total of 70,000 images divided into a training set of 60,000 images and a test set of 10,000 images. Each image is 28x28 pixels, grayscale.

5. How can I use the MNIST dataset with PyTorch?

To use the MNIST dataset in PyTorch, you can use the torchvision package, which includes utilities for loading datasets. Here's how you can load MNIST:
import torchvision.datasets as datasets
mnist_trainset = datasets.MNIST(root='./data', train=True, download=True, transform=None)
mnist_testset = datasets.MNIST(root='./data', train=False, download=True, transform=None)

Article Tags :

AI-ML-DS

Machine Learning

AI-ML-DS With Python