Open In App

N-Grams In R

N-grams are contiguous sequences of n items (words, characters, or symbols) extracted from a given sample of text or speech. They are widely used in natural language processing (NLP) and computational linguistics for various applications such as language modelling, text generation, and information retrieval. In R Programming Language the concept of n-grams is a fundamental building block in understanding the structure and patterns within textual data.

Types of N-grams:

The types of N-grams are determined by the value of ‘n’. Here are some common types:



Consider the sentence: “I love natural language processing.”

Tokenization:

Applications of N-grams

N-gram Probability

N-gram Smoothing

Limitations

Implimentation of N-Grams using R

Create dataset and tokenization

# Install and load the ngram package
install.packages("ngram")
 
library(ngram)
 
# Example text corpus
corpus <- c(
  "This is the first document.",
  "The second document is here.",
  "And this is the third one.",
"This, is an example sentence!"
)
# Convert to lowercase
corpus_lower <- tolower(corpus)
cat(corpus_lower)

                    

Output:



this is the first document. the second document is here. and this is the third one. this, is an example sentence!

Beg of N-Grams

Here we are converting text into unigrams,bigrams and trigrams.

# Tokenize into unigrams for the entire text
unigrams_text <- ngram(corpus_lower, n = 1,sep = ".!,/ ")
print(unigrams_text)
# Extract Unigram values
unigram_values <- get.ngrams(unigrams_text)
print(unigram_values)
 
# Tokenize into bigrams for the entire text
bigrams_text <- ngram(corpus_lower, n = 2, sep = ".!,/ ")
print(bigrams_text)
bigram_values <- get.ngrams(bigrams_text)
print(bigram_values)
 
# Tokenize into trigrams for the entire text
trigrams_text <- ngram(corpus_lower, n = 3, sep = ".!,/ ")
print(trigrams_text)
trigram_values <- get.ngrams(trigrams_text)
print(trigram_values)

                    

Output:

 An ngram object with 13 1-grams 
[1] "the" "an" "example" "and" "first" "document"
[7] "is" "second" "one" "sentence" "this" "third"
[13] "here"
An ngram object with 14 2-grams
[1] "document is" "and this" "is the" "third one"
[5] "this is" "first document" "the third" "example sentence"
[9] "an example" "second document" "is an" "the second"
[13] "the first" "is here"
An ngram object with 12 3-grams
[1] "the first document" "this is the" "the second document"
[4] "second document is" "and this is" "document is here"
[7] "this, is a" "is the third" "an example sentence"
[10] "the third one" "is the first" "is an example"

In this example, the sep = ” ” argument is used to specify space as the separator for creating bigrams. we can adjust the sep parameter to include any other characters that we want to use as separators between words.

Probabilities of each N-Grams

# Calculate probabilities for unigrams
unigram_probabilities <- get.phrasetable(unigrams_text)
cat("Unigram Probabilities:\n")
print(unigram_probabilities)
 
# Calculate probabilities for bigrams
bigram_probabilities <- get.phrasetable(bigrams_text)
cat("\nBigram Probabilities:\n")
print(bigram_probabilities)
 
# Calculate probabilities for trigrams
trigram_probabilities <- get.phrasetable(trigrams_text)
cat("\nTrigram Probabilities:\n")
print(trigram_probabilities)

                    

Output:

Unigram Probabilities:
ngrams freq prop
1 is 4 0.19047619
2 the 3 0.14285714
3 this 3 0.14285714
4 document 2 0.09523810
5 an 1 0.04761905
6 example 1 0.04761905
7 and 1 0.04761905
8 first 1 0.04761905
9 second 1 0.04761905
10 one 1 0.04761905
11 sentence 1 0.04761905
12 third 1 0.04761905
13 here 1 0.04761905
Bigram Probabilities:
ngrams freq prop
1 this is 3 0.17647059
2 is the 2 0.11764706
3 document is 1 0.05882353
4 and this 1 0.05882353
5 third one 1 0.05882353
6 first document 1 0.05882353
7 the third 1 0.05882353
8 example sentence 1 0.05882353
9 an example 1 0.05882353
10 second document 1 0.05882353
11 is an 1 0.05882353
12 the second 1 0.05882353
13 the first 1 0.05882353
14 is here 1 0.05882353
Trigram Probabilities:
ngrams freq prop
1 this is the 2 0.15384615
2 the first document 1 0.07692308
3 the second document 1 0.07692308
4 second document is 1 0.07692308
5 and this is 1 0.07692308
6 document is here 1 0.07692308
7 this is an 1 0.07692308
8 is the third 1 0.07692308
9 an example sentence 1 0.07692308
10 the third one 1 0.07692308
11 is the first 1 0.07692308
12 is an example 1 0.07692308

Conclusion

N-grams are a versatile and widely used concept in NLP, providing a flexible way to represent and analyze the structure of textual data. They form the basis for more advanced language models and applications in the field of natural language processing.


Article Tags :