- Do you want to achieve ‘the-state-of-the-art’ results in your next NLP project?
- Is your data insufficient for training the machine learning model?
- Do you want to improve the accuracy of your machine learning model with some extra data?
If yes, all you need is Data Augmentation. Whether you are building text classification, summarization, question answering, or any other machine learning model. Data Augmentation will help to improve the performance of your model.
There are five data augmentation techniques:
- Word Embeddings
- Back Translation
- Text to Text Transfer Transformer
- Ensemble Approach.
Text to Text Transfer Transformer:
Data augmentation using Text to Text Transfer Transformer (T5) is a large transformer model trained on the Colossal Clean Crawled Corpus (C4) dataset. Google open-sourced a pre-trained T5 model that is capable of doing multiple tasks like translation, summarization, question answering, and classification.
T5 reframes every NLP task into text to text format.
- Example 1: The T5 model can be trained for English German translation with Input translate text English to German, English text, and German text as output.
- Example 2: To train the model for sentiment classification input can be sentiment classification, input text, and Output can be the sentiment.
The same model can be trained for multiple tasks by specifying different tasks. Specific prefix string in input training data. T5 achieved state-of-the-art results on a variety of NLP tasks.
How to use this model for data augmentation?
This can be done in multiple ways.
In Back Translation, we used pre-trained models out of the box. If we want to use T5 out of the box, we can make use of its text summarization capabilities for data augmentation.
- T5 can take input in the format, summarize, input text, and generate a summary of the input.
- T5 is an abstractive summarization algorithm.
- T5 can rephrase sentences or use new words to generate the summary.
- T5 data augmentation technique is useful for NLP tasks involving long text documents.
For a short text, it may not give very good results.
Another approach to use T5 for data augmentation is to make use of the transfer learning technique and use the knowledge stored inside T5 to generate synthetic data. This can be done in multiple ways.
1) One way is to fine-tune T5 on a masked word prediction task, the same on which BERT is trained on.
We can use the same C4 dataset on which T5 is pre-trained to further fine-tune it for masked word prediction. So the input to the model will start with some prefix like predict mask followed by an input sentence having a masked word and the output will be the original sentence without a mask.
We can also mask multiple words in the same sentence and train T5 to predict the span of words. If we mask a single word, the model will not be able to generate new data that has variations in sentence structure. But if we mask multiple words, the model can learn to generate data with slight variations in sentence structure as well. This way, our data augmentation approach will be very similar to the BERT based approach.
2) Another way to use T5 for data augmentation is to fine-tune it on paraphrased generation task.
Paraphrasing means generating an output sentence that has the same meaning as that of input, but a different sentence structure and keyword. This is exactly what we need for data augmentation.
We are going to use the PAWS dataset to find tune T5 for paraphrase generation. PAWS stands for paraphrase adversaries from word scrambling. This dataset contains thousands of paraphrases and is available in six languages other than English.
So, Our data augmentation approach using T5 will be as follows:
Step 1: Involve some data preprocessing and which will convert the PAWS dataset into the format required for training T5.
Step 2: The next step will be to fine-tune, T5. For fine-tuning, Our input to the model will be in the format, generate paraphrased input text and output will be a paraphrase of the input text.
Once we have a fine-tuned model, we can use it to generate paraphrases of any input text. We can provide input with the prefix generate paraphrase and the model will output its paraphrase.
The model can be configured to output multiple paraphrases. This way we can very easily create our own paraphrase generation model for the data augmentation.
The pre-trained T5 model is available in five different sizes.
- T5 Small (60M Params)
- T5 Base (220 Params)
- T5 Large (770 Params)
- T5 3 B (3 B Params)
- T5 11 B (11 B Params)
The larger model gives better results, but also requires more computing power and takes a lot of time to train. But it’s a one-time process. Once you have a good quality fine-tuned paraphrase generation model trained on an appropriate dataset, it can be used for the data augmentation in several NLP tasks.
Implementation of Data Augmentation using T5
We are going to implement Data Augmentation using a Text to Text Transfer Transformer using the simple transformers library. This library is based on the Hugging face transformers Library. It makes it simple to fine-tune transformer-based models.
- Step 1: We’re going to upload PAWS data set (paraphrase adversaries from word scrambling) that we need for fine-tuning.
- Step 2: We need to prepare the dataset for training so that, we can start Fine-tuning the model.
- Step 3: We will create and save the fine-tuned model on Google Drive.
- Step 4: Finally, we will load the saved model and generate paraphrases that can be used for data augmentation
1) Install Dependencies
2) Prepare Dataset for training
You can download the dataset from this link PAWS wiki labelled dataset. It has three files train, dev, and test.tsv. We are only going to use train and dev files. This dataset has three columns, sentence one, sentence two, and label. The label is one of two sentences are paraphrases and zero otherwise.
When comparable rates of flow can be maintained, the results are high. The corresponding sentence2 shows that the results are high when comparable flow rates can be maintained. These two sentences are paraphrases, so the label is one.
We’re going to fine-tune T5 for paraphrase generation. So we need only paraphrases from this dataset. This means only the samples that have labeled one are useful for our task.
The size of this dataset is 49401.
Let’s keep only the pairs that have label one.
Now the size is reduced to almost half. T5 can be trained for multiple tasks. So when giving input to the model, we need to add some task-specific prefix. For that, We are adding a new column prefix in our data frame with the value generated paraphrase.
We need to rename sentence one and sentence two-column also. Note that rename should be “input_text” and “target_text” otherwise it will show runtime error.
Same steps, We need to apply to dev.tsv.
3) Fine-tune T5 for paraphrase generation
First, we need to decide some configuration parameters. You can go through Simple Transformer’s documentation to understand all these parameters.
The last parameter is to determine how many paraphrases to generate for every input. The paraphrases are generated using a combination of top-k sampling and top-p nucleus sampling.
To create an object of the T5 model class, we need to pass configuration parameters and the type of T5 model. T5 is available in multiple sizes, we’re going to use the T5 small version.
Saving the best model to google drive:
4) Generate paraphrases by typing prefix as “Generate Paraphrase for this line”
In some cases, it may not give good results, but there is a lot of scopes to improve the model instead of using the T5 small version. If we use a larger version and fine-tune it on a larger dataset, we can get much better results.
Also, you can go to the hugging face model repository and search for T5 there. You may find some T5 model fine-tuned on paraphrase generation. You can also try out these models or further fine-tune them on your domain-specific dataset.
This is the advantage of this data augmentation technique.