Open In App

Text augmentation techniques in NLP

Last Updated : 03 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Text augmentation is an important aspect of NLP to generate an artificial corpus. This helps in improving the NLP-based models to generalize better over a lot of different sub-tasks like intent classification, machine translation, chatbot training, image summarization, etc.

Text augmentation is used when:

  • There is an absence of sufficient variation in the text corpus.
  • There is a high data imbalance during intent classification tasks.
  • The overall quantity of data is insufficient for data-hungry machine-learning models.  

Data Augmentation in NLP

Data Augmentation (DA) is a technique employed to artificially expand training datasets by creating various versions of existing data without the need for additional data collection. Its primary goal is to enhance classification task performance by altering the data while preserving class categories.

In fields like Computer Vision and Natural Language Processing (NLP), where data scarcity and limited diversity pose challenges, data augmentation strategies play a crucial role. While generating augmented images is relatively straightforward, NLP faces complexities due to the nuanced nature of language. Unlike images, we can’t replace every word with a synonym, and even if substitution is possible, maintaining context becomes a significant challenge.

The pivotal role of data augmentation lies in its ability to boost model performance by increasing the volume of training data. More data generally leads to improved model performance. However, it is crucial to strike a balance in the distribution of augmented data. The generated data should neither closely resemble the original data nor deviate too much from it, as either extreme may result in overfitting or poor model performance. Therefore, effective data augmentation approaches should aim for a harmonious balance in data distribution.

Text-augmentation-techniques-in-NLP-Geeksforgeeks

The following three stages are where data augmentation techniques are used:

  • Word Level: At the word level, augmentation involves the transformation or replacement of individual words within the text. This could include synonyms, word shuffling, or other word-level modifications aimed at introducing variability without compromising the overall meaning or context of the text.
  • Sentence Level: Augmentation at the sentence level focuses on altering the structure and composition of entire sentences. Techniques such as paraphrasing, sentence shuffling, or introducing grammatical variations are employed. The goal is to diversify the dataset by presenting the model with different formulations of ideas while maintaining the essence of the original content.
  • Document Level: At the document level, augmentation extends to the entire document or piece of text. This may involve introducing substantial changes, such as inserting or removing paragraphs, reordering sections, or even changing the overall writing style. Document-level augmentation aims to provide the model with exposure to varied document structures and writing conventions.

Easy data Augmentation

Synonym Replacement

Synonym replacement is a technique to replace a set of token(s) from a text/document with another set of token(s) with equal or equivalent meaning without changing the overall meaning or context of the original phrase. The sentence generated post-replacement is called a synthetic phrase. 

The synonym replacement can occur at any or all three of a document, i.e. at the character level, word/token level, or sentence level.   

Word-embedding based

Word embeddings that have been trained, such GloVe, Word2Vec, and fastText, can find the word vector that is closest to the original text in the embedding space.

Because contextual bidirectional embeddings such as ELMo and BERT have substantially richer vector representations, they are recommended for improved reliability. Bidirectional LSTM and Transformer models are more successful since they can encode longer text sequences and show contextual knowledge of the words around them.

Lexical-based

Token(s) with close or exact meaning is replaced directly in the string to achieve lexical-based augmented text. Wordnet is used to tag and find the respective synonyms, hyponyms, and meronyms for the token(s) in the text.

  • Random Insertion

Randomly inserting redundant or semantically similar words/tokens that do not include stopwords increase the meaning of a sentence/embedding or keep it semantically unchanged resulting in the creation of augmented text. This step should be done carefully as the introduction of negative words in a sentence can change its sentiment completely.

original text: This is a geekforgeeks example.
post-random insertion text: This is a new geekforgeeks example.
  • Random Swap

Randomly replacing similar words/tokens that keep the semantic meaning of the sentence/embedding unchanged results in the creation of augmented text.

original text: This is a geekforgeeks example.
post-random swapped text: This is a geekforgeeks illustration.
  • Random deletion

Random deletion of redundant words that keep the semantic meaning of the sentence/embedding unchanged results in the creation of augmented text.

original text: This is kind of a geekforgeeks example.
post-random deletion text: This is a new geekforgeeks illustration.

Generative Models

Data-to-text generation

It is a natural language generation technique to create artificial texts based on unstructured data like data tables, JSONs, knowledge graphs, etc, or structured databases like SQL, PostgreSQL, etc.

1. Soft Data-to-text generation

This technique uses a soft computing approach(like BERT, roBERTa, BART) to create sentences from unstructured data. This technique requires supervised training to map each unique type of unstructured data to create semantically meaningful sentences from the data. 

Complex and insightful sentences can be generated using this technique when a rich language model is used. These techniques are implemented in large language models to create conversational AIs similar to ChatGPT.

original text: Who is the president of the United States?
hard generated text: Joe Biden is currently the serving president of the United States of America.

2 Hard Data-to-text generation 

This technique makes use of custom algorithms to create meaningful strings by preprocessing the unstructured/structured data. This type of text generation relies on understanding the existing dataset completely, making manual inferences, and producing its corresponding text. The complexity of the generated text relies on the developer’s insight into the data. 

This technique is usually used as verbose to make it user-understandable in data warehouses, diary entries dataset, etc where the end-user has direct read access to it.

original text: 20/03/2023 - 04/04/2023
hard generated text: From 20th of march to the 4th of April.

text-to-text generation

1. Text summarization

This text generation technique relies on generating summarised sentences from documents/articles, where we reduce the larger sentences into smaller sentences while keeping the semantic meaning of the original text. There are two approaches to this:

  1. Extractive method: This method uses the frequency-based approach to keep only those sentences containing the most-frequent topic words.
  2. Abstractive method: This is a more advanced, and powerful method that makes use of language models, Bi-directional transformers(BERT), and GPT.
original text: This is a geekforgeeks example. It is part of a larger example and we can probably shorten it to create a new sentence.
summarised text: We can shorten this example to create a new sentence.

2. Paraphrasing

Paraphrasing in itself is an NLP subtask that is specifically used to generate semantically coherent sentences using altering words. Paraphrasing uses rule-based approaches like synonym replacement of POS-tagged words of the same tag as well as uses the machine-learning-based approach to change the entire sentence altogether without changing its meaning.

original text: This is a geekforgeeks example.
paraphrased text: This is an example from geekforgeeks.

TextAttack library and Data Augmentation

TextAttack is a Python framework created especially for use in data augmentation, adversarial training, and adversarial attacks in the Natural Language Processing (NLP) domain. Only the field of text data augmentation will be covered in this essay.

Text-augmentation-techniques-in-NLP-2-Geeksforgeeks

The textattack operates under the textattack framework.The six different methods of the Augmenter class are specifically designed for NLP data augmentation.The six types of augmenters are:

  • WordNetAugmenter
  • EmbeddingAugmenter
  • CharSwapAugmenter
  • CheckListAugmenter
  • EasyDataAugmenter
  • CLAREAugumenter

For installation of Textattack

!pip install textattack

WordNetAugmenter

The WordNetAugmenter in TextAttack utilizes WordNet, a lexical database of the English language, for word substitution-based data augmentation. It replaces words in a sentence with their synonyms or hypernyms, enhancing text diversity.

Python3




from textattack.augmentation import WordNetAugmenter
 
augmenter = WordNetAugmenter()
 
# Example usage:
sentence = "The quick brown fox jumps over the lazy dog."
augmented_sentence = augmenter.augment(sentence)
 
print(f"Original Sentence: {sentence}")
print(f"Augmented Sentence: {augmented_sentence}")


Output:

Original Sentence: The quick brown fox jumps over the lazy dog.
Augmented Sentence: ['The quick brown fox jumps over the lazy click.']

Embedding Augmenter

The EmbeddingAugmenter in TextAttack is designed to perform data augmentation by substituting words in a sentence with their word embeddings. This augments the text while preserving semantic meaning.

Python3




from textattack.augmentation import EmbeddingAugmenter
 
# Initialize the EmbeddingAugmenter
embed_aug = EmbeddingAugmenter()
 
# Example usage:
original_text = "TextAttack is a powerful library for NLP."
augmented_text = embed_aug.augment(original_text)
 
print(f"Original Text: {original_text}")
print(f"Augmented Text: {augmented_text}")


Output:

Original Text: TextAttack is a powerful library for NLP.
Augmented Text: ['TextAttack is a emphatic library for NLP.']

CharSwapAugmenter

The CharSwapAugmenter in TextAttack is an augmentation technique that randomly swaps adjacent characters in a word to introduce small, character-level perturbations. This can help improve the robustness of natural language processing models by simulating variations in the input text at the character level.

Python3




from textattack.augmentation import CharSwapAugmenter
 
# Initialize the CharSwapAugmenter
char_swap_augmenter = CharSwapAugmenter()
 
# Example usage:
original_text = "TextAttack is a powerful library for NLP."
augmented_text = char_swap_augmenter.augment(original_text)
 
print(f"Original Text: {original_text}")
print(f"Augmented Text: {augmented_text}")


Output:

Original Text: TextAttack is a powerful library for NLP.
Augmented Text: ['TextAttack is a powerqul library for NLP.']

Checklist Augmenter

The ChecklistAugmenter in TextAttack is an augmentation technique that uses pre-defined transformations to generate perturbed versions of the input text. It leverages the CheckList library’s transformations for various NLP tasks. The CheckList library provides a collection of linguistic transformations that can be applied to text data to test and improve model robustness.

Python3




from textattack.augmentation import CheckListAugmenter
 
# Sample text
text = "TextAttack is a powerful library for NLP."
 
# Initialize the CheckListAugmenter
checklist_augmenter = CheckListAugmenter()
 
# Apply CheckList transformations
augmented_text = checklist_augmenter.augment(text)
 
# Print the results
print(f"Original Text: {text}")
print(f"Augmented Text: {augmented_text}")


Output:

Original Text: TextAttack is a powerful library for NLP.
Augmented Text: ['TextAttack is a powerful library for NLP.']

EasyDataAugmenter

EDA enhances text by incorporating a mix of word insertions, substitutions, and deletions.

Python3




# Assuming EasyDataAugmenter exists in the textattack library (this is a hypothetical example)
from textattack.augmentation import EasyDataAugmenter
 
# Example text
text = "TextAttack is a powerful library for NLP."
 
# Initialize the EasyDataAugmenter
eda_augmenter = EasyDataAugmenter()
 
# Apply EasyDataAugmenter for text augmentation
augmented_text = eda_augmenter.augment(text)
 
# Print the results
print(f"Original Text: {text}")
print(f"Augmented Text: {augmented_text}")


Output:

Original Text: TextAttack is a powerful library for NLP.
Augmented Text: ['NLP is a powerful library for TextAttack.',
'TextAttack is a potent library for NLP.',
'TextAttack is a powerful for NLP.',
'ampere TextAttack is a powerful library for NLP.']

CLAREAugmenter

It enhances text through the utilization of a pre-trained masked language model, involving operations such as replacement, insertion, and merging.

Back Translation

Back translation technique is used when the required data is present in a different language. The documents present in the source language is translated to the target language using Machine Translation models. A major drawback of this method is that meaning of language-specific words is sometimes lost during translation. Therefore, the corpus to be translated should contain simple words that can be easily and accurately translated. 

original text: This is a geek for geeks example.
translated text: यह गीक्स उदाहरण के लिए एक गीक है।

Back Transliteration 

Back transliteration is a technique used to generate sentences/phrases that sound phonetically similar to the source language. This is useful for generating training data for classification tasks that contain localized, or bi-lingual phrases where the target language is a low-resource language i.e. has fewer data sources.

original text: This is a geek for geeks example.
transliterated text: थिस इस अ गैक फ़ोर गैक्स ए‍अम्प्ले

Advantages of Data Augmentation

A number of benefits are provided by data augmentation in natural language processing (NLP), which enhances model resilience and performance:

  • Increased Data Diversity: Data augmentation introduces variations in the input data by creating diverse instances of the original data. This helps expose the model to a broader range of linguistic patterns and variations.
  • Improved Generalization: Augmented data aids in enhancing the generalization ability of NLP models. By presenting the model with a more extensive and varied dataset during training, it learns to handle a wider array of scenarios, leading to better performance on unseen data.
  • Addressing Data Scarcity: In many NLP tasks, obtaining a large labeled dataset can be challenging. Data augmentation mitigates the issue of data scarcity by artificially expanding the dataset, allowing models to be trained on a more substantial amount of data.

Disadvantages of Data Augmentation

Natural language processing (NLP) data augmentation has a number of benefits, but there are also some drawbacks and things to keep in mind:

  • Risk of Introducing Unintended Biases: Data augmentation methods may inadvertently introduce biases into the augmented data, potentially leading to biased model predictions. Careful consideration is needed to ensure that augmented samples do not reinforce existing biases or introduce new ones.
  • Potential Overfitting to Augmented Patterns: If not carefully controlled, models might overfit to the specific patterns introduced by data augmentation, rather than learning more generalizable features. This can occur if augmentation is applied excessively or without proper validation.
  • Increased Computational Complexity: Augmenting data increases the computational requirements during training, as the model needs to process a larger amount of augmented data. This can lead to longer training times and increased resource consumption.

Frequently Asked Questions

1. What is Data Augmentation in NLP?

Data augmentation in NLP involves creating variations of the original text data to increase the diversity of the training dataset. This is done to enhance model generalization and performance.

2. Why is Data Augmentation Used in NLP?

Data augmentation is used to address challenges such as data scarcity, improve model robustness, and enhance generalization by exposing models to a broader range of linguistic variations.

3. What are Common Data Augmentation Techniques in NLP?

Common data augmentation techniques in NLP include synonym replacement, random insertion, random deletion, character-level modifications, and leveraging pre-trained language models for contextual augmentations.

4. How Does Data Augmentation Prevent Overfitting?

Data augmentation introduces variations in the training data, making it less likely for the model to memorize specific examples. This helps prevent overfitting and encourages the model to learn more generalized patterns.

5. Can Data Augmentation Introduce Biases into Models?

Yes, data augmentation methods may inadvertently introduce biases into the augmented data. It is crucial to carefully design and validate augmentation techniques to avoid reinforcing or introducing biases.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads