N-Grams In R

Last Updated : 16 Nov, 2023

N-grams are contiguous sequences of n items (words, characters, or symbols) extracted from a given sample of text or speech. They are widely used in natural language processing (NLP) and computational linguistics for various applications such as language modelling, text generation, and information retrieval. In R Programming Language the concept of n-grams is a fundamental building block in understanding the structure and patterns within textual data.

Types of N-grams:

The types of N-grams are determined by the value of ‘n’. Here are some common types:

Unigrams (1-grams): Single words, representing the most basic units of text.
Probability of a unigram P(w_i):
$P(w_i) = \frac{Count(w_i)}{\text{Total number of tokens}}$
Bigrams (2 grams): Pairs of consecutive words. Example: “natural language,” “processing model.”
Probability of a bigram P(w_i | w_{i-1}):
$P(w_i | w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})}$
Trigrams (3-grams): Triplets of consecutive words. Example: “contiguous sequences of,” “pairs of consecutive.”
Probability of a trigram P(w_i | w_{i-2}, w_{i-1}):
$P(w_i | w_{i-2}, w_{i-1}) = \frac{\text{Count}(w_{i-2}, w_{i-1}, w_i)}{\text{Count}(w_{i-2}, w_{i-1})}$

Consider the sentence: “I love natural language processing.”

Unigrams: “I,” “love,” “natural,” “language,” “processing.”
Bigrams: “I love,” “love natural,” “natural language,” “language processing.”
Trigrams: “I love natural,” “love natural language,” “natural language processing.”

Tokenization:

Before generating n-grams, the text is typically tokenized, breaking it into individual words or characters.
Tokenization is a crucial step that defines the units from which n-grams are extracted.

Applications of N-grams

Language Modeling: N-grams are used to model the likelihood of a sequence of words in a language. This is foundational to machine translation, speech recognition, and other language generation tasks.
Text Prediction: Given a sequence of n-1 words, n-grams can be used to predict the next word in a sentence.
Information Retrieval: In search engines, n-grams are used to index and retrieve documents based on user queries.
Text Classification: N-grams are used as features in text classification tasks, helping to capture local word patterns.

N-gram Probability

The probability of an n-gram occurring is often estimated from the frequency of its occurrence in a given corpus.
For example, the probability of a bigram “word1 word2” is estimated as the count of occurrences of the sequence divided by the total number of bigrams in the corpus.

N-gram Smoothing

In practice, some n-grams may not be present in the training corpus. Smoothing techniques, such as add-one (Laplace) smoothing, are applied to handle unseen n-grams.
Smoothing ensures that all n-grams have non-zero probabilities.

Limitations

N-grams have limitations, especially with longer sequences, as they may not capture long-range dependencies in language effectively.
Techniques like neural language models, such as LSTMs and Transformers, are employed to address these limitations.

Implimentation of N-Grams using R

Create dataset and tokenization

R

# Install and load the ngram package
install.packages("ngram")
 
library(ngram)
 
# Example text corpus
corpus <- c(
  "This is the first document.",
  "The second document is here.",
  "And this is the third one.",
"This, is an example sentence!"
)
# Convert to lowercase
corpus_lower <- tolower(corpus)
cat(corpus_lower)

Output:

this is the first document. the second document is here. and this is the third one. this, is an example sentence!

Beg of N-Grams

Here we are converting text into unigrams,bigrams and trigrams.

R

# Tokenize into unigrams for the entire text
unigrams_text <- ngram(corpus_lower, n = 1,sep = ".!,/ ")
print(unigrams_text)
# Extract Unigram values
unigram_values <- get.ngrams(unigrams_text)
print(unigram_values)
 
# Tokenize into bigrams for the entire text
bigrams_text <- ngram(corpus_lower, n = 2, sep = ".!,/ ")
print(bigrams_text)
bigram_values <- get.ngrams(bigrams_text)
print(bigram_values)
 
# Tokenize into trigrams for the entire text
trigrams_text <- ngram(corpus_lower, n = 3, sep = ".!,/ ")
print(trigrams_text)
trigram_values <- get.ngrams(trigrams_text)
print(trigram_values)

Output:

 An ngram object with 13 1-grams 
 [1] "the"      "an"       "example"  "and"      "first"    "document"
 [7] "is"       "second"   "one"      "sentence" "this"     "third"   
[13] "here"    
An ngram object with 14 2-grams 
 [1] "document is"      "and this"         "is the"           "third one"       
 [5] "this is"          "first document"   "the third"        "example sentence"
 [9] "an example"       "second document"  "is an"            "the second"      
[13] "the first"        "is here"         
An ngram object with 12 3-grams 
 [1] "the first document"  "this is the"         "the second document"
 [4] "second document is"  "and this is"         "document is here"   
 [7] "this, is a"          "is the third"        "an example sentence"
[10] "the third one"       "is the first"        "is an example"

In this example, the sep = ” ” argument is used to specify space as the separator for creating bigrams. we can adjust the sep parameter to include any other characters that we want to use as separators between words.

Probabilities of each N-Grams

R

# Calculate probabilities for unigrams
unigram_probabilities <- get.phrasetable(unigrams_text)
cat("Unigram Probabilities:\n")
print(unigram_probabilities)
 
# Calculate probabilities for bigrams
bigram_probabilities <- get.phrasetable(bigrams_text)
cat("\nBigram Probabilities:\n")
print(bigram_probabilities)
 
# Calculate probabilities for trigrams
trigram_probabilities <- get.phrasetable(trigrams_text)
cat("\nTrigram Probabilities:\n")
print(trigram_probabilities)

Output:

Unigram Probabilities:
      ngrams freq       prop
1        is     4 0.19047619
2       the     3 0.14285714
3      this     3 0.14285714
4  document     2 0.09523810
5        an     1 0.04761905
6   example     1 0.04761905
7       and     1 0.04761905
8     first     1 0.04761905
9    second     1 0.04761905
10      one     1 0.04761905
11 sentence     1 0.04761905
12    third     1 0.04761905
13     here     1 0.04761905
Bigram Probabilities:
              ngrams freq       prop
1           this is     3 0.17647059
2            is the     2 0.11764706
3       document is     1 0.05882353
4          and this     1 0.05882353
5         third one     1 0.05882353
6    first document     1 0.05882353
7         the third     1 0.05882353
8  example sentence     1 0.05882353
9        an example     1 0.05882353
10  second document     1 0.05882353
11            is an     1 0.05882353
12       the second     1 0.05882353
13        the first     1 0.05882353
14          is here     1 0.05882353
Trigram Probabilities:
                 ngrams freq       prop
1          this is the     2 0.15384615
2   the first document     1 0.07692308
3  the second document     1 0.07692308
4   second document is     1 0.07692308
5          and this is     1 0.07692308
6     document is here     1 0.07692308
7           this is an     1 0.07692308
8         is the third     1 0.07692308
9  an example sentence     1 0.07692308
10       the third one     1 0.07692308
11        is the first     1 0.07692308
12       is an example     1 0.07692308

Conclusion

N-grams are a versatile and widely used concept in NLP, providing a flexible way to represent and analyze the structure of textual data. They form the basis for more advanced language models and applications in the field of natural language processing.

Suggest improvement

Harmonic Mean in R

Share your thoughts in the comments