Open In App
Related Articles

Transfer Learning with Fine-tuning in NLP

Improve Article
Save Article
Like Article

Transfer learning involves using a pre-trained model on a specific task and applying its learned knowledge to a different but related task. The basic ideology of the feature is that the features learned by the pre-trained model on a large dataset can be generalized and useful for other tasks, even if the new task has a different dataset. The process typically involves taking a pre-trained model, removing its last layers, and replacing them with new layers. The initial layers of the pre-trained model are fine-tuned with a small learning rate to preserve the learned representations. They help in capturing the general features. The newly added layers are then trained using the new dataset specific to the target task.

Fine-tuning refers to the process of taking a pre-trained model and further training it on a new dataset. Fine-tuning involves training the entire model, including the initial layers. The learning rate used for the initial layers is set to a small value to prevent significant changes. While the later layers make use of a higher learning rate to adapt to the new dataset.

Both transfer learning and fine-tuning are widely used in natural language processing. They offer practical solutions to overcome limitations posed by small datasets and allow for the efficient development of deep learning models with improved performance.


Transfer Learning with Fine Tuning for a deep learning model

For the understanding of Transfer Learning with Fine Tuning on an NLP model, let us consider a BERT model. BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art language representation model developed by Google. It is designed to capture contextual relationships and meanings of words within sentences or documents. There is an explicit train-test split. Fine-tuning the BERT model is a custom task using a small sample of input texts and corresponding labels. The input_texts list contains two examples, and the labels list contains their corresponding labels. This example can help us classify positive and negative comments.

Importing Libraries and Dataset

In the code, ‘tokenizer’ refers to the BERT tokenizer. The tokenizer is responsible for converting input text into numerical representations that can be understood by the BERT model.


!pip install transformers
import transformers
import torch
from transformers import AdamW
from transformers import BertTokenizer,\

By importing the torch module, you’ll be able to use the necessary functionalities from the PyTorch library. By importing AdamW from transformers, you’ll be able to use it as the optimizer for fine-tuning the BERT model. In this step, the pre-trained BERT model is loaded. It is then with our own dataset loading.


pretrained_model_name = 'bert-base-uncased'

Transfer Learning

The code starts by loading the pre-trained BERT model (bert-base-uncased) using the BertForSequenceClassification.from_pretrained() method. This model has been pre-trained on a large corpus to learn general language representations and contextualized word embeddings. By loading this pre-trained model, we are leveraging the knowledge and insights gained from its pre-training.


tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)
model = BertForSequenceClassification.from_pretrained(pretrained_model_name)

Tokenize and Encode the Data

Tokenization is a common step in NLP that helps in preparing the text data for further processing. The code utilizes the BERT tokenizer to tokenize the input texts. The tokenizer splits the text into tokens and performs additional tasks such as adding special tokens, truncating or padding the sequences to a fixed length and generating attention masks.


input_texts = ['This is a positive review.',
               'This is a negative review.']
labels = [1, 0]
input_ids = []
attention_masks = []
for text in input_texts:
    encoded_dict = tokenizer.encode_plus(
input_ids =, dim=0)
attention_masks =, dim=0)
labels = torch.tensor(labels)

Fine Tuning the BERT Model

Fine-tuning refers to the process of adapting the pre-trained BERT model to a specific downstream task. Fine-tuning involves training the BERT model on a task-specific dataset with labeled examples.

After loading the pre-trained BERT model (BertForSequenceClassification.from_pretrained()), the optimizer (AdamW) and the loss function (CrossEntropyLoss) are defined. The model is put into training mode using model.train(). This ensures that the model is set to train and update its parameters during the fine-tuning process. The training loop runs for a specified number of epochs. For each epoch, the loop iterates through the dataset (dataloader) to obtain batches of input data. Within the loop, the optimizer’s gradient is set to zero using optimizer.zero_grad() to clear any previous gradients.


batch_size = 2
epochs = 3
optimizer = AdamW(model.parameters(), lr=2e-5)
for epoch in range(epochs):
    for i in range(0, input_ids.size(0), batch_size):
        batch_input_ids = input_ids[i:i+batch_size]
        batch_attention_masks = attention_masks[i:i+batch_size]
        batch_labels = labels[i:i+batch_size]
        outputs = model(
        loss = outputs.loss

Model Predictions

‘tokenizer.encode_plus()’ is a method provided by the Hugging Face Transformers library’s tokenizer class. It is used to tokenize and encode a given input text or pair of texts into numerical representations that can be understood by the BERT model. Use the fine-tuned BERT model for predictions:


test_texts = ['This is another review.',
              'I am not sure about this.']
test_input_ids = []
test_attention_masks = []
for text in test_texts:
    encoded_dict = tokenizer.encode_plus(
test_input_ids =, dim=0)
test_attention_masks =, dim=0)
with torch.no_grad():
    outputs = model(
predicted_labels = torch.argmax(outputs.logits, dim=1)

This says the overall output of the label that has to be predicted:


for text, label in zip(test_texts, predicted_labels):
    print(f'Text: {text}\nPredicted Label: {label.item()}\n')


Text: This is another review.
Predicted Label: 1

Text: I am not sure about this.
Predicted Label: 1

The following steps are to be followed to demonstrate the overall working of transfer learning with fine tuning using an already built-in model of BERT. The output model given above shows this output:

The output of the fine-tuned BERT model in the provided code example is expected to be better than a normal (untrained) model because of the following reasons:

  • Pre-training on Large-Scale Data: The pre-trained BERT model has been trained on a massive amount of text data, such as Wikipedia articles, to learn general language representations. This pre-training allows the model to capture a deep understanding of language patterns and semantics, which can be beneficial for a wide range of NLP tasks.
  • Transfer of Knowledge: By fine-tuning the pre-trained BERT model on a specific task with a smaller labeled dataset, the model can leverage the knowledge and representations learned during pre-training. The pre-trained model has already learned useful features and linguistic patterns, which can be transferable to the target task. This transfer of knowledge helps the fine-tuned model perform better compared to training a model from scratch on the same task.
  • Generalization Capability: The fine-tuned BERT model has the ability to generalize well to new, unseen data. This is because the model has been exposed to diverse language patterns during pre-training and fine-tuning. As a result, the model can capture the nuances and context of the input texts, leading to more accurate predictions on new examples.
  • Capturing Task-Specific Information: During the fine-tuning process, the BERT model is adapted to the specific task by updating its parameters on the task-specific dataset. This allows the model to learn task-specific patterns, features, and decision boundaries, further enhancing its predictive capabilities.

Overall, the output of the fine-tuned BERT model is expected to be better than a normal model because it benefits from pre-training on large-scale data, transfer of knowledge, generalization capability, and task-specific adaptation. The fine-tuning process allows the model to harness the power of pre-trained language representations and apply them to specific NLP tasks, resulting in improved performance and more accurate predictions.

In conclusion, transfer learning with fine-tuning in Natural Language Processing (NLP) is a powerful technique that leverages pre-trained models to enhance the performance of specific NLP tasks.

Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape, GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out - check it out now!

Last Updated : 11 Jul, 2023
Like Article
Save Article
Similar Reads
Complete Tutorials