How to Break a CAPTCHA System with Machine Learning?

Last Updated : 27 Jun, 2023

CAPTCHA, short for Completely Automated Public Turing Test to Tell Computers and Humans Apart, is a revolutionary technology that helps identify humans from bots and saves your site from malicious intentions. But this technology has begun to show its age. Captcha was supposed to be a robust system, but artificial intelligence is driving it almost useless. To break a Captcha, we require a machine-learning model which we need to train. After its training, all that is required is to feed the model any CAPTCHA you want, which it will solve for you.

Through this article, we will explore how one can break a CAPTCHA system with the help of machine learning. We will discuss in detail the complete process. Besides, we will also share the limitations of this approach and the ethical and moral issues that need to be considered while attempting this. This should be remembered that our intention behind breaking CAPTCHA should be to educate ourselves and highlight the incapability of the system to filter out non-humans. But CAPTCHAs are the things saving sites from malicious attacks, and they are effectively safeguarding the internet. So, using bots to break CAPTCHAs on websites without permission is unethical at best and also illegal, depending on your location.

Collection of a Dataset

The collection of Data is the first and essential step in training a Machine Learning Model. It is no different here. First, we need to find a dataset with many CAPTCHA images. The dataset needs to be diverse to ensure the model would be able to solve any CAPTCHA it is given.The collection of CAPTCHA images is not that easy of a feat. Finding a legal way to acquire the datasets is quite an involved process, and if you want to scrape them from websites, you should be informed that doing it without permission might be illegal and it is also unethical. So, we need to resort to using open-source datasets.

A dataset that can be used is a small Dataset from Kaggle. It is sufficient for learning about Captchas. You can find it here.

A dataset is effectively a folder with images and labels. You just need to mention the path, and it is as simple as that.

Preprocessing the Images

The second step in training the model is preprocessing. With Preprocessing, we can feed the model data it actually needs. Preprocessing consists of various steps like Cropping, Noise Reduction, Greyscale, and much more which allows us to create a better Machine learning Model. Various processes are done to transform the image while Preprocessing. These include Grayscale Conversion, Resizing, etc. These steps make the input images simpler which helps the computer identify patterns in these images.

This step can be done with OpenCV or the library of your choice using any supported programming language like Python or R Programming Language. OpenCV is the industry and hobbyist choice for preprocessing and is very reliable. But we need to mention the libraries we would use:

Python3

import os 
import numpy as np 
import matplotlib.pyplot as plt 
from pathlib import Path 
from collections import Counter 
import tensorflow as tf 
from tensorflow import keras 
from tensorflow.keras import layers 

The Code is also available in notebook format. To access it follow this link:

Python3

# Path to the Dataset 
direc = Path("ML\samples") 
  
dir_img = sorted(list(map(str, list(direc.glob("*.png"))))) 
img_labels = [img.split(os.path.sep)[-1]. 
              split(".png")[0] for img in dir_img] 
char_img = set(char for label in img_labels for char in label) 
char_img = sorted(list(char_img)) 
  
print("Number of dir_img found: ", len(dir_img)) 
print("Number of img_labels found: ", len(img_labels)) 
print("Number of unique char_img: ", len(char_img)) 
print("Characters present: ", char_img) 
  
# Batch Size of Training and Validation 
batch_size = 16
  
# Setting dimensions of the image 
img_width = 200
img_height = 50
  
# Setting downsampling factor 
downsample_factor = 4
  
# Setting the Maximum Length 
max_length = max([len(label) for label in img_labels]) 
  
# Char to integers 
char_to_num = layers.StringLookup( 
    vocabulary=list(char_img), mask_token=None
) 
  
# Integers to original chaecters 
num_to_char = layers.StringLookup( 
    vocabulary=char_to_num.get_vocabulary(), 
    mask_token=None, invert=True
) 
  
  
def data_split(dir_img, img_labels, 
               train_size=0.9, shuffle=True): 
    # Get the total size of the dataset 
    size = len(dir_img) 
    # Create an indices array and shuffle it if required 
    indices = np.arange(size) 
    if shuffle: 
        np.random.shuffle(indices) 
    # Calculate the size of training samples 
    train_samples = int(size * train_size) 
    # Split data into training and validation sets 
    x_train, y_train = dir_img[indices[:train_samples]], 
    img_labels[indices[:train_samples]] 
    x_valid, y_valid = dir_img[indices[train_samples:]], 
    img_labels[indices[train_samples:]] 
    return x_train, x_valid, y_train, y_valid 
  
  
# Split data into training and validation sets 
x_train, x_valid,\ 
    y_train, y_valid = data_split(np.array(dir_img), 
                                  np.array(img_labels)) 
  
  
def encode_sample(img_path, label): 
    # Read the image 
    img = tf.io.read_file(img_path) 
    # Converting the image to grayscale 
    img = tf.io.decode_png(img, channels=1) 
    img = tf.image.convert_image_dtype(img, tf.float32) 
    # Resizing to the desired size 
    img = tf.image.resize(img, [img_height, img_width]) 
    # Transposing the image 
    img = tf.transpose(img, perm=[1, 0, 2]) 
    # Mapping image label to numbers 
    label = char_to_num(tf.strings.unicode_split(label, 
                                                 input_encoding="UTF-8")) 
  
    return {"image": img, "label": label} 

Training the machine learning model

After preprocessing comes the hard part, we can use various machine learning algorithms and techniques to break CAPTCHA. Convolutional Neural Networks(CNNs) and Recurrent Neural Networks(RNNs) can both be used to break CAPTCHA. While CNNs are a perfect match for image recognition and are very effective while recognizing images, RNNs can process sequential data very proficiently, suitable for things like audio-based CAPTCHA. Preprocessed images can be fed to the Machine Learning model. Using clever mathematics, the model will start to recognize patterns in the provided images and it adjusts its weights and biases and learns.

But there is one hurdle. The CAPTCHA images are highly variable, and this makes finding patterns quite hard for Machine Learning models. So, data augmentation has to be used to make the test data more variable. This can be done by rotating, scaling, and flipping. But before data augmentation, we need to split the data into two parts, one for training and the other for testing. This way, we can identify how accurate our model is later. Libraries like TensorFlow can help you create CNNs of your choice for a wide variety of applications, so it is a valid choice for this use.

Python3

# Creating training dataset 
dataset_train = tf.data.Dataset.from_tensor_slices((x_train, y_train)) 
dataset_train = ( 
    dataset_train.map( 
        encode_sample, num_parallel_calls=tf.data.AUTOTUNE 
    ) 
    .batch(batch_size) 
    .prefetch(buffer_size=tf.data.AUTOTUNE) 
) 
  
  
# Creating validation dataset 
val_data = tf.data.Dataset.from_tensor_slices((x_valid, y_valid)) 
val_data = ( 
    val_data.map( 
        encode_sample, num_parallel_calls=tf.data.AUTOTUNE 
    ) 
    .batch(batch_size) 
    .prefetch(buffer_size=tf.data.AUTOTUNE) 
)

Now let’s plot some images from the training data.

Python3

# Visualizing some training data 
_, ax = plt.subplots(4, 4, figsize=(10, 5)) 
for batch in dataset_train.take(1): 
    dir_img = batch["image"] 
    img_labels = batch["label"] 
    for i in range(16): 
        img = (dir_img[i] * 255).numpy().astype("uint8") 
        label = tf.strings.reduce_join(num_to_char( 
            img_labels[i])).numpy().decode("utf-8") 
        ax[i // 4, i % 4].imshow(img[:, :, 0].T, cmap="gray") 
        ax[i // 4, i % 4].set_title(label) 
        ax[i // 4, i % 4].axis("off") 
plt.show() 
) 

Output:

Python3

# CTC loss calculation 
class LayerCTC(layers.Layer): 
    def __init__(self, name=None): 
        super().__init__(name=name) 
        self.loss_fn = keras.backend.ctc_batch_cost 
  
    def call(self, y_true, y_pred): 
        # Compute the training-time loss value 
        batch_len = tf.cast(tf.shape(y_true)[0], 
                            dtype="int64") 
        input_length = tf.cast(tf.shape(y_pred)[1], 
                               dtype="int64") 
        label_length = tf.cast(tf.shape(y_true)[1], 
                               dtype="int64") 
  
        input_length = input_length * \ 
            tf.ones(shape=(batch_len, 1), dtype="int64") 
        label_length = label_length * \ 
            tf.ones(shape=(batch_len, 1), dtype="int64") 
  
        loss = self.loss_fn(y_true, y_pred, 
                            input_length, label_length) 
        self.add_loss(loss) 
  
        # Return Computed predictions 
        return y_pred 
  
  
def model_build(): 
    # Define the inputs to the model 
    input_img = layers.Input( 
        shape=(img_width, img_height, 1), 
        name="image", dtype="float32"
    ) 
    img_labels = layers.Input(name="label", 
                              shape=(None,), 
                              dtype="float32") 
  
    # First convolutional block 
    x = layers.Conv2D( 
        32, 
        (3, 3), 
        activation="relu", 
        kernel_initializer="he_normal", 
        padding="same", 
        name="Conv1", 
    )(input_img) 
    x = layers.MaxPooling2D((2, 2), name="pool1")(x) 
  
    # Second convolutional block 
    x = layers.Conv2D( 
        64, 
        (3, 3), 
        activation="relu", 
        kernel_initializer="he_normal", 
        padding="same", 
        name="Conv2", 
    )(x) 
    x = layers.MaxPooling2D((2, 2), name="pool2")(x) 
  
    # Reshaping the output before passing to RNN 
    new_shape = ((img_width // 4), (img_height // 4) * 64) 
    x = layers.Reshape(target_shape=new_shape, name="reshape")(x) 
    x = layers.Dense(64, activation="relu", name="dense1")(x) 
    x = layers.Dropout(0.2)(x) 
  
    # RNNs 
    x = layers.Bidirectional(layers.LSTM( 
        128, return_sequences=True, dropout=0.25))(x) 
    x = layers.Bidirectional(layers.LSTM( 
        64, return_sequences=True, dropout=0.25))(x) 
  
    # Output layer 
    x = layers.Dense( 
        len(char_to_num.get_vocabulary()) + 1, 
        activation="softmax", name="dense2"
    )(x) 
  
    # Calculate CTC loss at each step 
    output = LayerCTC(name="ctc_loss")(img_labels, x) 
  
    # Defining the model 
    model = keras.models.Model( 
        inputs=[input_img, img_labels], 
        outputs=output, 
        name="ocr_model_v1"
    ) 
    opt = keras.optimizers.Adam() 
  
    # Compile the model 
    model.compile(optimizer=opt) 
  
    return model 

After creating an instance of the model now let’s print the summary of the model and the number of parameters that have been used in this model.

Python3

# Build the model 
model = model_build() 
model.summary()

Output:

Model: "ocr_model_v1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 image (InputLayer)             [(None, 200, 50, 1)  0           []                               
                                ]                                                                 
                                                                                                  
 Conv1 (Conv2D)                 (None, 200, 50, 32)  320         ['image[0][0]']                  
                                                                                                  
 pool1 (MaxPooling2D)           (None, 100, 25, 32)  0           ['Conv1[0][0]']                  
                                                                                                  
 Conv2 (Conv2D)                 (None, 100, 25, 64)  18496       ['pool1[0][0]']                  
                                                                                                  
 pool2 (MaxPooling2D)           (None, 50, 12, 64)   0           ['Conv2[0][0]']                  
                                                                                                  
 reshape (Reshape)              (None, 50, 768)      0           ['pool2[0][0]']                  
                                                                                                  
 dense1 (Dense)                 (None, 50, 64)       49216       ['reshape[0][0]']                
                                                                                                  
 dropout (Dropout)              (None, 50, 64)       0           ['dense1[0][0]']                 
                                                                                                  
 bidirectional (Bidirectional)  (None, 50, 256)      197632      ['dropout[0][0]']                
                                                                                                  
 bidirectional_1 (Bidirectional  (None, 50, 128)     164352      ['bidirectional[0][0]']          
 )                                                                                                
                                                                                                  
 label (InputLayer)             [(None, None)]       0           []                               
                                                                                                  
 dense2 (Dense)                 (None, 50, 21)       2709        ['bidirectional_1[0][0]']        
                                                                                                  
 ctc_loss (LayerCTC)            (None, 50, 21)       0           ['label[0][0]',                  
                                                                  'dense2[0][0]']                 
                                                                                                  
==================================================================================================
Total params: 432,725
Trainable params: 432,725
Non-trainable params: 0
__________________________________________________________________________________________________

Now we are ready to train the model. We will train the model for 100 epochs and along with some early stopping methods so, that the model does not overfit the data.

Python3

# Early Stopping Parameters and EPOCH 
epochs = 100
early_stopping_patience = 10
  
  
early_stopping = keras.callbacks.EarlyStopping( 
    monitor="val_loss", 
    patience=early_stopping_patience, 
    restore_best_weights=True
) 
  
# Training the model 
history = model.fit( 
    dataset_train, 
    validation_data=val_data, 
    epochs=epochs, 
    callbacks=[early_stopping], 
) 

Output:

Epoch 80/100
59/59 [==============================] - 2s 36ms/step - loss: 1.7622 - val_loss: 7.1511
Epoch 81/100
59/59 [==============================] - 2s 35ms/step - loss: 1.7216 - val_loss: 7.0523
Epoch 82/100
59/59 [==============================] - 3s 47ms/step - loss: 1.5814 - val_loss: 7.1403
Epoch 83/100
59/59 [==============================] - 2s 37ms/step - loss: 1.6464 - val_loss: 7.0921
Epoch 84/100
59/59 [==============================] - 2s 35ms/step - loss: 1.6113 - val_loss: 7.1740
Epoch 85/100
59/59 [==============================] - 2s 35ms/step - loss: 1.5529 - val_loss: 7.1272
Epoch 86/100
59/59 [==============================] - 2s 39ms/step - loss: 1.5346 - val_loss: 7.0750

Testing the Machine Learning Model

To ensure that the model can break the CAPTCHA system, it is very important to test its performance. But we can’t use the images we used to train it. So, we will use the images we didn’t use in the previous step. Various metrics are used to determine how good our model is. These are F1 Score, accuracy, recall, precision and etc.

F1 Score: F1 Score is one metric to understand how good a model is. It is a function of accuracy and recall and ranges from 0-1.
Accuracy: Accuracy is the ratio between correct prediction and total predictions.
Recall: Recall states how accurately it can identify all the data points of a given class.
Precision: Precision is the ratio between no of true predictions and no of positive predictions made by the model.

Despite our best efforts, an AI model would never be fully accurate. So, we can’t rely on the model to always be effective. This is partly because Machine Learning is being used on the servers of CAPTCHA services as well to identify and block the attempts to break the CAPTCHA system. Not only that, but CAPTCHAs are also becoming harder to understand for Machine Learning models as they are introducing new types of CAPTCHAs. Now it’s time to know how well the model works.

Python3

# Get the Model 
prediction_model = keras.models.Model( 
    model.get_layer(name="image").input, 
    model.get_layer(name="dense2").output 
) 
prediction_model.summary() 
  
  
def decode_batch_predictions(pred): 
    input_len = np.ones(pred.shape[0]) * pred.shape[1] 
    results = keras.backend.ctc_decode(pred, 
                                       input_length=input_len, 
                                       greedy=True)[0][0][ 
        :, :max_length 
    ] 
    output_text = [] 
    for res in results: 
        res = tf.strings.reduce_join(num_to_char(res))\ 
        .numpy().decode("utf-8") 
        output_text.append(res) 
    return output_text 

Again we use the trained model to predict the text that is present in the captcha codes.

Python3

# Check the validation on a few samples 
for batch in val_data.take(1): 
    batch_images = batch["image"] 
    batch_labels = batch["label"] 
  
  
    preds = prediction_model.predict(batch_images) 
    pred_texts = decode_batch_predictions(preds) 
  
  
    orig_texts = [] 
    for label in batch_labels: 
        label = tf.strings.reduce_join(num_to_char(label))\ 
        .numpy().decode("utf-8") 
        orig_texts.append(label) 
  
  
    _, ax = plt.subplots(4, 4, figsize=(15, 5)) 
    for i in range(len(pred_texts)): 
        img = (batch_images[i, :, :, 0] * 255).\ 
        numpy().astype(np.uint8) 
        img = img.T 
        title = f"Prediction: {pred_texts[i]}"
        ax[i // 4, i % 4].imshow(img, cmap="gray") 
        ax[i // 4, i % 4].set_title(title) 
        ax[i // 4, i % 4].axis("off") 
plt.show()

Output:

Generating Adversarial Examples

In Layman’s terms, Adversarial Examples mean the inputs created with the sole intention of confusing a neural network. These inputs improve the performance of the model, as it is exposed to CAPTCHAs that are hard to solve. We need to follow a few steps to generate adversarial examples:

Choosing the Target Model
Selection of Dataset
Defining the Objective function
Generation of Adversarial

So, according to the steps, a target model is chosen, and we select the dataset the model was originally trained on. Then the objective function identifies how different the output is from the real output. This helps the process of adversarial creation. Afterward, we can use numerous techniques to generate adversarial. These include Jacobian Based Saliency Map(JBSM), Fast Gradient Sign Method(FGSM), and many more. After the creation of adversarial, your model is ready to be improved. Again TensorFlow can be used to create these adversarial examples, and you can create them quite easily.

Conclusion

In conclusion, a CAPTCHA system can be broken using machine learning. One just needs to train a machine learning model to identify the patterns, and it can recognize the characters for you. Through this article, we discussed the essential steps to solve a Machine learning model. We iterated over every step and trained, tuned, and optimized our model. The model is able to identify unseen patterns and solve CAPTCHAs it has never even seen. We also implemented procedures to check how accurate the model is. With time, CAPTCHA is evolving as well. Various CAPTCHA systems are already implementing measures to protect themselves against automated attacks. Our intentions should be to improve on those measures and create a safe environment for everyone.

Suggest improvement

How to approach a Machine Learning project : A step-wise guidance

Share your thoughts in the comments

How to Break a CAPTCHA System with Machine Learning?

Collection of a Dataset

Preprocessing the Images

Python3

Python3

Training the machine learning model

Python3

Python3

Python3

Python3

Python3

Testing the Machine Learning Model

Python3

Python3

Generating Adversarial Examples

Conclusion

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?