Stop Word Removal In R

In Natural Language Processing, different words carry different amounts of information. We want to select only those words which are meaningful to the machine learning models. By analyzing the text, some words are often found repetitive and do not carry much information. Instead, they degrade the model performance by introducing unnecessary bias. These words are known as stop words.

Stop Words Package

Stop words are words such as a, an, the, etc. which are ignored by the search engines, and machine learning algorithms since they do not add much information to the text that we are operating on. Developing one’s list of stopwords is not always followed since some words are often found commonly in all documents. Hence with the help of the stopwords package available in R Programming Language, we can use it to remove the stopwords.

# import the stopwords library

library(stopwords)

Step 1: Different sources of the stopword library have different sizes. There are mainly three main sources:

smart: The stopword list is based on the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System.
snowball: The stopword list is taken from the Snowball stemmer project.
stopwords-iso: The collection follows the ISO 639-1 language code.

# find the length of different sources of stopwords

sprintf("Length of smart stopwords is %d", length(stopwords(source = "smart")))

sprintf("Length of snowball stopwords is %d", length(stopwords(source = "snowball")))

sprintf("Length of stopwords-iso stopwords is %d", length(stopwords(source = "stopwords-iso")))

Output:

'Length of smart stopwords is 571'
'Length of snowball stopwords is 175'
'Length of stopwords-iso stopwords is 1298'

Step 2: The lists of stopwords contain many similar words. To visualize it, we first store the list of words.

# Get stopwords lists

stopwords_smart = stopwords(source = "smart")

stopwords_snowball = stopwords(source = "snowball")

stopwords_iso = stopwords(source = "stopwords-iso")

Step 3: We now find the intersection using the intersection functions. It takes the lists and then find the intersection among them.

# Find the intersections

is_sm_sn = intersect(stopwords_smart, stopwords_snowball)

is_sm_iso = intersect(stopwords_smart, stopwords_iso)

is_sn_iso = intersect(stopwords_snowball, stopwords_iso)

Step 4: Create a matrix and fill it with the length of the respective intersections.

# Create a matrix with intersection values

mat = matrix(0, nrow = 3, ncol = 3, dimnames = list(c("smart", "snowball", 

                                                      "stopwords-iso"), 

                                               c("smart", "snowball", "stopwords-iso")))

mat["smart", "snowball"] = length(is_sm_sn)

mat["smart", "stopwords-iso"] = length(is_sm_iso)

mat["snowball", "stopwords-iso"] = length(is_sn_iso)
 
# mirroring the values from x,y to y,x in the matrix

mat["stopwords-iso", "snowball"] = mat["snowball", "stopwords-iso"] 

mat["stopwords-iso", "smart"] = mat["smart", "stopwords-iso"]

mat["snowball", "smart"] = mat["smart", "snowball"]

Step 5: Plot the matrix with cells with lower values will be green and higher values with blue.

library(ggplot2)
 
ggplot(data = as.data.frame(as.table(mat)),

       aes(Var1, Var2, fill = Freq)) +

  geom_tile(color = "white") +

  geom_text(aes(label = ifelse(Freq > 0, as.character(Freq), "")), vjust = 1) +  

  scale_fill_gradient(low = "lightgreen", high = "blue") +

  labs(title = "Intersection of Stopwords",

       x = "",

       y = "",

       fill = "Intersection Count") +

  theme_minimal() +

  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Output

Here we can see that the intersection between the smart and snow is 165 where as snowball with the stopwords-iso is 175 hence they are quite same. Almost all the elements in smart are present in stopwords-iso since we can see the count to be 570. Hence the stopwords-iso is the most exhaustive list and can be useful for our exploration.

Exploring the stopwords in a Poem

Now we are going to explore the stopwords in a poem. We are going to use the janeaustenr library since it provides access to the full texts of Jane Austen’s 6 published novels. The UTF-8 plain text for each novel was sourced from Project Gutenberg. The following are the novels

sensesensibility: Sense and Sensibility, 1811
prideprejudice: Pride and Prejudice, 1813
mansfieldpark: Mansfield Park, 1814
emma: Emma, published in 1815
northangerabbey: Northanger Abbey, 1818
persuasion: Persuasion, 1818

We are going to use the prideprejudice novel.

Step 1: Import the library and store the novel

library(janeaustenr)
poem = prideprejudice

head(poem)

Output:

'PRIDE AND PREJUDICE''''By Jane Austen'''''''

Step 2: Now we are going to import the tm library. It is the text mining library. We will import the library with the alias of tm since otherwise, it would interfere with the stopwords package that we already imported.

tm = loadNamespace("tm")

So any method of textmining library would be used as
tm or Text mining library is a framework for text mining applications within R. The functions that we are going to use is as follows:

tm_map: It is used to apply transformation functions (also denoted as mappings) to corpus or the text document converted to corpus.
Corpus: These are collections of documents containing (natural language) text.
VectorSource: It creates a vector source. A vector source interprets each element of the vector x as a document.
content_transformer: It is used to create content transformer functions which modify the content of an R object.
removePunctuation: It is used to remove the punctuation marks from a text document.
removeNumbers: It is used to remove the numbers from the text document.
DocumentTermMatrix: Constructs or coerces to a term-document matrix or a document-term matrix.

Step 3: We will preprocess the text. We will proceed in the following steps

Convert to corpus and then to vector source.
Convert the text to lowercase. For this, we are going to use the tolower character cascade. It translates the characters in character vectors, in particular from upper to
lower case.
We will remove the punctuation and the numbers using the text mining library.

# Create a poem_corpus

poem_corpus = tm$Corpus(tm$VectorSource(poem))
 
# Preprocess the poem_corpus

poem_corpus = tm$tm_map(poem_corpus, tm$content_transformer(tolower))

poem_corpus = tm$tm_map(poem_corpus, tm$removePunctuation)

poem_corpus = tm$tm_map(poem_corpus, tm$removeNumbers)

Step 4: Then we will create a document term matrix of the corpus. It represents the frequency of words that occur in a collection of documents.

# Create a document-term matrix

dtm = tm$DocumentTermMatrix(poem_corpus)
 
dtm

Output:

<<DocumentTermMatrix (documents: 13030, terms: 6698)>>
Non-/sparse entries: 91442/87183498
Sparsity           : 100%
Maximal term length: 26
Weighting          : term frequency (tf)

Step 5: Then we will calculate and store the frequency with the help of colSums() method. It forms row and column sums and means for numeric arrays (or data frames).

# Convert DTM to a data frame

doc_term_mat = as.data.frame(as.matrix(dtm))
 
# Calculate word frequencies

word_freq = colSums(doc_term_mat)

Here first we convert our document term matrix into a dataframe since we can run colSums on dataframe only.

Step 6: Now we will create a list that stores boolean value whether the word_freq containing the words is present in the stopwords list or not. So we use the dplyr library for the purpose of filtering and creating a new dataframe with a column containing frequency and another column that whether it is a stopword or not. We only take top 20 words.

library(dplyr)

library(stopwords)
 
stopwords_list = names(word_freq) %in% stopwords(source = "smart")

word_freq_df = data.frame(word = names(word_freq), freq = word_freq, 

                          is_stopword = stopwords_list) %>%

  arrange(desc(freq)) %>%

  head(20)

stopwords_list: It is a boolean array which marks that the words are stopwords or not.

arrange: We sort the dataframe using the dplyr package. It is a tool for data manipulation, providing a consistent set of verbs solving most common data manipulation tasks.
head: It is used to take the top values. Here we provide 20 to take top 20 values.

Step 7: Now we will use the ggplot the plot the frequencies. For plotting the frequencies we will use the red colour for stopwords and blue for non-stopwords.

ggplot(word_freq_df, aes(x = reorder(word, -freq), y = freq, fill = factor(is_stopword))) +

  geom_bar(stat = "identity") +

  scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "skyblue")) +

  labs(title = "Top 20 Words in Pride and Prejudice") +

  theme_minimal() +

  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +

  coord_cartesian(clip = "off")

Output:

Stop Word Removal In R

We see among the top 20 words, only 1 word is non stopword. Hence stopwords make the text highly information less by there large frequency.

Removing stopwords from text in R

Now we will learn to remove the stopwords from the text Corpus.

We will use the same poem for the removal of stopwords. First

Step 1: Store the stopwords in a variable.

# impor the library stopwords

library(stopwords)
 
sm_stopwords = stopwords(source = "smart")

head(sm_stopwords)

Output:

'a''a\'s''able''about''above''according'

Step 2: Perform the text cleaning using the text mining library. First import the text mining and janeaustenr(for novel) and then perform the basic cleaning that we performed in the previous example.

# text mining library

tm = loadNamespace("tm")

library(janeaustenr) # package for novels
 
poem = prideprejudice
 
# Create a poem_corpus

poem_corpus = tm$Corpus(tm$VectorSource(poem))
 
# Preprocess the poem_corpus

poem_corpus = tm$tm_map(poem_corpus, tm$content_transformer(tolower))

poem_corpus = tm$tm_map(poem_corpus, tm$removePunctuation)

poem_corpus = tm$tm_map(poem_corpus, tm$removeNumbers)

Step 3: Now we will remove the stopwords by providing the smart stopwords list.

# removing stopwords

poem_corpus = tm$tm_map(poem_corpus, tm$removeWords, sm_stopwords)

removeWords: It is used to remove words from a text document.

Step 4: Then we will create a document term matrix of the corpus and then sum the columns for the words list with frequency.

# Create a document-term matrix

dtm = tm$DocumentTermMatrix(poem_corpus)
 
# Convert DTM to a data frame

doc_term_mat = as.data.frame(as.matrix(dtm))
 
# Calculate word frequencies

word_freq = colSums(doc_term_mat)

Step 5: Finally we create our word frequency dataframe similar to previous example and now we see only the words that are not stopwords.

library(dplyr)
 
stopwords_list = names(word_freq) %in% sm_stopwords

word_freq_df = data.frame(word = names(word_freq), freq = word_freq,

                          is_stopword = stopwords_list) %>%

  arrange(desc(freq)) %>%

  head(20)
 
ggplot(word_freq_df, aes(x = reorder(word, -freq), y = freq, fill = factor(is_stopword))) +

  geom_bar(stat = "identity") +

  scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "skyblue")) +

  labs(title = "Top 20 Words in Pride and Prejudice") +

  theme_minimal() +

  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +

  coord_cartesian(clip = "off")

Output:

Stop Word Removal In R

Now we see only the words that are not stopwords. We can also see how less there frequencies are compared to the previous example.

Stopwords in Different Languages in R

The stopwords library also provides the stopwords in different languages.

To get the language specific stopword, the syntax is as follows:

stopwords("language", source="stopwords-iso")

The source=”smart” is only available for english language.

To get the stopwords for french language, we use the language as french.

library(stopwords)

stop_french = stopwords("french", source="stopwords-iso")

head(stop_french, 20)

Output:

[1] "a"           "abord"       "absolument"  "afin"        "ah"         
 [6] "ai"          "aie"         "aient"       "aies"        "ailleurs"   
[11] "ainsi"       "ait"         "allaient"    "allo"        "allons"     
[16] "allô"        "alors"       "anterieur"   "anterieure"  "anterieures"

We have just printed the top 20 words.

For dutch language, we pass dutch.

stop_dutch = stopwords("Dutch", source="stopwords-iso")

print(head(stop_dutch, 20))

Output:

 [1] "aan"       "aangaande" "aangezien" "achte"     "achter"    "achterna" 
 [7] "af"        "afgelopen" "al"        "aldaar"    "aldus"     "alhoewel" 
[13] "alias"     "alle"      "allebei"   "alleen"    "alles"     "als"      
[19] "alsnog"    "altijd"

Conclusion

In summary, removing stop words optimizes text data for search engines and machine learning models, improving efficiency and accuracy. The ‘tm’ library in R provides a robust toolset for this task, ensuring that common, non-informative words do not impede database storage or consume unnecessary processing time.

Article Tags :

AI-ML-DS

Geeks Premier League

NLP

R Language

Geeks Premier League 2023