Machine Translation with Transformer in Python

Machine translation converts a sequence of text from one language to another. Popular online translation services like Google Translate, Microsoft Translator, and others use machine translation techniques to provide users with quick and accessible translations between a wide range of languages. Transformer models are the most recent and widely adopted approach to machine translation. They are based on Seq2Seq architecture and capture context to learn the mappings between source and target languages. In this article, we will be using a model for hugging faces and fine-tuning it to convert our text from English to Hindi.

What is a Transformer?

The transformer architecture can process all the parts of input in parallel through its self-attention mechanism without the need to sequentially process them. The transformer architecture has two parts: an encoder and a decoder. If we want to build an application to convert a sentence from one language to another (English to Hindi), we need to use both the encoder and decoder blocks. This was the original problem (known as a sequence-to-sequence translation) for which the transformer architecture was developed.

However, depending on the type of task, we can either use the encoder block only or the decoder block only of the transformer architecture. The core of the encoder and decoder blocks is multi-head attention. The only difference is the use of masking in the decoder block. These layers tell the model to pay specific attention to certain elements in the input sequence and ignore others when computing the feature representations.

Helsinki-NLP is a Natural Language Processing (NLP) model that can translate different types of content. The University of Helsinki has been actively involved in various NLP projects and research endeavours. One notable project is the Helsinki-NLP GitHub repository, where the University of Helsinki’s NLP researchers and developers contribute to open-source projects related to natural language processing. This repository includes implementations, models, and tools for a variety of NLP tasks. We will use the helsinkis English to Hindi model and fine-tune it on our dataset.

Machine Translation using Transformers

1. Libraries installation

Install the below libraries if not available in your environment. These are required to run the subsequent code.

  1. torch is an open-source ml framework that provides flexible an efficient platform for building and training deep neural networks.
  2. dataset is required for loading the data on which we will finetune or model.
  3. transformers is required to load the pretrained model from hugging face.
  4. transformers[batch] contains libraries required that is required while fine tuning like accelerate.
  5. evaluate and sacrebleu is used for evaluation of our model.
  6. sentencepiece is used by the tokenizer.
!pip install datasets
!pip install transformers
!pip install sentencepiece
!pip install transformers[torch]`
!pip install sacrebleu
!pip install evaluate
!pip install sacrebleu
!pip install accelerate -U
!pip install gradio
!pip install kaleido cohere openai tiktoken typing-extensions==4.5.0

2. Dataset loading

Let us load the dataset using the dataset library.

Dataset Used

We will use cfilt/iitb-english-hindi dataset available on hugging face

The IIT Bombay English-Hindi corpus comprises parallel texts for English-Hindi and monolingual Hindi texts sourced from various existing platforms and corpora established at the Center for Indian Language Technology, IIT Bombay, over time. It is a resource for training and evaluating English-Hindi machine translation models. Researchers and developers can use the datasets to improve the accuracy and performance of machine translation systems for these languages.

To get more specific details about the “cfilt/iitb-english-hindi” dataset, including its size, source, and any specific characteristics, check the official documentation or publications from CFILT or IITB.

from datasets import load_dataset
dataset = load_dataset("cfilt/iitb-english-hindi")

3. Model and Tokenizer loading

max_length = 256
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-hi")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-hi")

Let us see the output of model on one of the validation datasets. The input sequence is: ‘Rajesh Gavre, the President of the MNPA teachers association, honoured the school by presenting the award’ .

article = dataset['validation'][2]['translation']['en']
inputs = tokenizer(article, return_tensors="pt")
translated_tokens = model.generate(
     **inputs,  max_length=256
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]


'एमएनएपी शिक्षकों के राष्ट्रपति, राजस्वीवर ने इस पुरस्कार को पेश करके स्कूल की प्रतिष्ठा की'

Let’s check the expected output using the following code.



'मनपा शिक्षक संघ के अध्यक्ष राजेश गवरे ने स्कूल को भेंट देकर सराहना की।'

Let us fine tune the model.

4. Tokenize the dataset

  1. The preprocess_function is a function designed for preprocessing examples from a translation dataset.
  2. Input Extraction: We extract the English sentences from the “en” field of each example in the “translation” field of the input examples dictionary. Similarly, we extract the Hindi sentences from the “hi” field of each example in the “translation” field.
  3. Tokenization: The max_length parameter specifies the maximum length of the tokenized sequences, and truncation=True indicates that the sequences should be truncated if they exceed the maximum length. We tokenized the inputs and targets using the tokenizer defined above
  4. Label Preparation: We assigns the tokenized Hindi sentences’ input IDs to the “labels” key in the model_inputs dictionary. This step is crucial for training sequence-to-sequence models, as it provides the model with the correct target sequences during training.
  5. The function returns the preprocessed inputs in a format suitable for training a sequence-to-sequence model. The model_inputs dictionary likely contains tokenized representations of the English sentences and the corresponding tokenized labels (Hindi sentences) with special attention to the “labels” key for training.

def preprocess_function(examples):
  inputs = [ex["en"] for ex in examples["translation"]]
  targets = [ex["hi"] for ex in examples["translation"]]
  model_inputs = tokenizer(inputs, max_length=max_length, truncation=True)
  labels = tokenizer(targets,max_length=max_length, truncation=True)
  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

We map each of the examples of our dataset using the map function.

tokenized_datasets_validation = dataset['validation'].map(
    batched= True,
    batch_size = 2
tokenized_datasets_test = dataset['test'].map(
    batched= True,
    batch_size = 2)

5. Define the datacollator

from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

6. Model training parameters

Since our model is already trianed for english to hindi translation we will freeze some layers. The model has 6 layers of encoder and decoder block. We will freeze the first four layers of encoder and decoder and train only the last two layers

# Access the layers and freeze the specified number of layers
# Specify the number of layers to freeze from the end
for parameter in model.parameters():
    parameter.requires_grad = True
num_layers_to_freeze = 10  # Adjust as needed
for layer_index, layer in enumerate(model.model.encoder.layers):
    if layer_index < len(model.model.encoder.layers) - num_layers_to_freeze:
        for parameter in layer.parameters():
            parameter.requires_grad = False
num_layers_to_freeze = 10  # Adjust as needed
for layer_index, layer in enumerate(model.model.decoder.layers):
    if layer_index < len(model.model.encoder.layers) - num_layers_to_freeze:
        for parameter in layer.parameters():
            parameter.requires_grad = False

7. Model evaluation

import evaluate
metric = evaluate.load("sacrebleu")
import numpy as np
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

8. Model training

from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(

We initiate training using below code

from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(


Step    Training Loss    
500 2.920800
1000 2.555000
1500 2.437100
2000 2.389700

Let us build a gradio app for our model and see how it works.

This will launch an interface which can be used for demo purpose.

import gradio as gr
def translate(text):
  inputs = tokenizer(text, return_tensors="pt").to(device)
  translated_tokens = model.generate(**inputs,  max_length=256)
  results = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
  return results
#Creating the User Interface Space
interface = gr.Interface(fn=translate,inputs=gr.Textbox(lines=2, placeholder='Text to translate'),
#launching the interface


Gradio Interface


In this article we saw how we can load a pretrained model from hugging face and fine tune it our specific dataset for machine translation. Readers are encouraged to apply above steps for their specific dataset and can play with the hyperparameters. In order to get a reasonable output, it is recommended to use a large dataset and train for significant amount of epoch. For this access to GPU would be required.

