Open In App

Stop Word Removal In R

In Natural Language Processing, different words carry different amounts of information. We want to select only those words which are meaningful to the machine learning models. By analyzing the text, some words are often found repetitive and do not carry much information. Instead, they degrade the model performance by introducing unnecessary bias. These words are known as stop words.

Stop Words Package

Stop words are words such as a, an, the, etc. which are ignored by the search engines, and machine learning algorithms since they do not add much information to the text that we are operating on. Developing one’s list of stopwords is not always followed since some words are often found commonly in all documents. Hence with the help of the stopwords package available in R Programming Language, we can use it to remove the stopwords.






# import the stopwords library
library(stopwords)

Step 1: Different sources of the stopword library have different sizes. There are mainly three main sources:




# find the length of different sources of stopwords
sprintf("Length of smart stopwords is %d", length(stopwords(source = "smart")))
sprintf("Length of snowball stopwords is %d", length(stopwords(source = "snowball")))
sprintf("Length of stopwords-iso stopwords is %d", length(stopwords(source = "stopwords-iso")))

Output:



'Length of smart stopwords is 571'
'Length of snowball stopwords is 175'
'Length of stopwords-iso stopwords is 1298'

Step 2: The lists of stopwords contain many similar words. To visualize it, we first store the list of words.




# Get stopwords lists
stopwords_smart = stopwords(source = "smart")
stopwords_snowball = stopwords(source = "snowball")
stopwords_iso = stopwords(source = "stopwords-iso")

Step 3: We now find the intersection using the intersection functions. It takes the lists and then find the intersection among them.




# Find the intersections
is_sm_sn = intersect(stopwords_smart, stopwords_snowball)
is_sm_iso = intersect(stopwords_smart, stopwords_iso)
is_sn_iso = intersect(stopwords_snowball, stopwords_iso)

Step 4: Create a matrix and fill it with the length of the respective intersections.




# Create a matrix with intersection values
mat = matrix(0, nrow = 3, ncol = 3, dimnames = list(c("smart", "snowball",
                                                      "stopwords-iso"),
                                               c("smart", "snowball", "stopwords-iso")))
mat["smart", "snowball"] = length(is_sm_sn)
mat["smart", "stopwords-iso"] = length(is_sm_iso)
mat["snowball", "stopwords-iso"] = length(is_sn_iso)
 
# mirroring the values from x,y to y,x in the matrix
mat["stopwords-iso", "snowball"] = mat["snowball", "stopwords-iso"]
mat["stopwords-iso", "smart"] = mat["smart", "stopwords-iso"]
mat["snowball", "smart"] = mat["smart", "snowball"]

Step 5: Plot the matrix with cells with lower values will be green and higher values with blue.




library(ggplot2)
 
ggplot(data = as.data.frame(as.table(mat)),
       aes(Var1, Var2, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = ifelse(Freq > 0, as.character(Freq), "")), vjust = 1) + 
  scale_fill_gradient(low = "lightgreen", high = "blue") +
  labs(title = "Intersection of Stopwords",
       x = "",
       y = "",
       fill = "Intersection Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Output

Here we can see that the intersection between the smart and snow is 165 where as snowball with the stopwords-iso is 175 hence they are quite same. Almost all the elements in smart are present in stopwords-iso since we can see the count to be 570. Hence the stopwords-iso is the most exhaustive list and can be useful for our exploration.

Exploring the stopwords in a Poem

Now we are going to explore the stopwords in a poem. We are going to use the janeaustenr library since it provides access to the full texts of Jane Austen’s 6 published novels. The UTF-8 plain text for each novel was sourced from Project Gutenberg. The following are the novels

  1. sensesensibility: Sense and Sensibility, 1811
  2. prideprejudice: Pride and Prejudice, 1813
  3. mansfieldpark: Mansfield Park, 1814
  4. emma: Emma, published in 1815
  5. northangerabbey: Northanger Abbey, 1818
  6. persuasion: Persuasion, 1818

We are going to use the prideprejudice novel.

Step 1: Import the library and store the novel




library(janeaustenr)
poem = prideprejudice
head(poem)

Output:

'PRIDE AND PREJUDICE''''By Jane Austen'''''''

Step 2: Now we are going to import the tm library. It is the text mining library. We will import the library with the alias of tm since otherwise, it would interfere with the stopwords package that we already imported.




tm = loadNamespace("tm")

So any method of textmining library would be used as
tm or Text mining library is a framework for text mining applications within R. The functions that we are going to use is as follows:

Step 3: We will preprocess the text. We will proceed in the following steps




# Create a poem_corpus
poem_corpus = tm$Corpus(tm$VectorSource(poem))
 
# Preprocess the poem_corpus
poem_corpus = tm$tm_map(poem_corpus, tm$content_transformer(tolower))
poem_corpus = tm$tm_map(poem_corpus, tm$removePunctuation)
poem_corpus = tm$tm_map(poem_corpus, tm$removeNumbers)

Step 4: Then we will create a document term matrix of the corpus. It represents the frequency of words that occur in a collection of documents.




# Create a document-term matrix
dtm = tm$DocumentTermMatrix(poem_corpus)
 
dtm

Output:

<<DocumentTermMatrix (documents: 13030, terms: 6698)>>
Non-/sparse entries: 91442/87183498
Sparsity : 100%
Maximal term length: 26
Weighting : term frequency (tf)

Step 5: Then we will calculate and store the frequency with the help of colSums() method. It forms row and column sums and means for numeric arrays (or data frames).




# Convert DTM to a data frame
doc_term_mat = as.data.frame(as.matrix(dtm))
 
# Calculate word frequencies
word_freq = colSums(doc_term_mat)

Here first we convert our document term matrix into a dataframe since we can run colSums on dataframe only.

Step 6: Now we will create a list that stores boolean value whether the word_freq containing the words is present in the stopwords list or not. So we use the dplyr library for the purpose of filtering and creating a new dataframe with a column containing frequency and another column that whether it is a stopword or not. We only take top 20 words.




library(dplyr)
library(stopwords)
 
stopwords_list = names(word_freq) %in% stopwords(source = "smart")
word_freq_df = data.frame(word = names(word_freq), freq = word_freq,
                          is_stopword = stopwords_list) %>%
  arrange(desc(freq)) %>%
  head(20)

stopwords_list: It is a boolean array which marks that the words are stopwords or not.

Step 7: Now we will use the ggplot the plot the frequencies. For plotting the frequencies we will use the red colour for stopwords and blue for non-stopwords.




ggplot(word_freq_df, aes(x = reorder(word, -freq), y = freq, fill = factor(is_stopword))) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "skyblue")) +
  labs(title = "Top 20 Words in Pride and Prejudice") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_cartesian(clip = "off")

Output:

Stop Word Removal In R

We see among the top 20 words, only 1 word is non stopword. Hence stopwords make the text highly information less by there large frequency.

Removing stopwords from text in R

Now we will learn to remove the stopwords from the text Corpus.

We will use the same poem for the removal of stopwords. First

Step 1: Store the stopwords in a variable.




# impor the library stopwords
library(stopwords)
 
sm_stopwords = stopwords(source = "smart")
head(sm_stopwords)

Output:

'a''a\'s''able''about''above''according'

Step 2: Perform the text cleaning using the text mining library. First import the text mining and janeaustenr(for novel) and then perform the basic cleaning that we performed in the previous example.




# text mining library
tm = loadNamespace("tm")
library(janeaustenr) # package for novels
 
 
poem = prideprejudice
 
# Create a poem_corpus
poem_corpus = tm$Corpus(tm$VectorSource(poem))
 
# Preprocess the poem_corpus
poem_corpus = tm$tm_map(poem_corpus, tm$content_transformer(tolower))
poem_corpus = tm$tm_map(poem_corpus, tm$removePunctuation)
poem_corpus = tm$tm_map(poem_corpus, tm$removeNumbers)

Step 3: Now we will remove the stopwords by providing the smart stopwords list.




# removing stopwords
poem_corpus = tm$tm_map(poem_corpus, tm$removeWords, sm_stopwords)

removeWords: It is used to remove words from a text document.

Step 4: Then we will create a document term matrix of the corpus and then sum the columns for the words list with frequency.




# Create a document-term matrix
dtm = tm$DocumentTermMatrix(poem_corpus)
 
# Convert DTM to a data frame
doc_term_mat = as.data.frame(as.matrix(dtm))
 
# Calculate word frequencies
word_freq = colSums(doc_term_mat)

Step 5: Finally we create our word frequency dataframe similar to previous example and now we see only the words that are not stopwords.




library(dplyr)
 
stopwords_list = names(word_freq) %in% sm_stopwords
word_freq_df = data.frame(word = names(word_freq), freq = word_freq,
                          is_stopword = stopwords_list) %>%
  arrange(desc(freq)) %>%
  head(20)
 
ggplot(word_freq_df, aes(x = reorder(word, -freq), y = freq, fill = factor(is_stopword))) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "skyblue")) +
  labs(title = "Top 20 Words in Pride and Prejudice") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_cartesian(clip = "off")

Output:

Stop Word Removal In R

Now we see only the words that are not stopwords. We can also see how less there frequencies are compared to the previous example.

Stopwords in Different Languages in R

The stopwords library also provides the stopwords in different languages.

To get the language specific stopword, the syntax is as follows:

stopwords("language", source="stopwords-iso")

The source=”smart” is only available for english language.

To get the stopwords for french language, we use the language as french.




library(stopwords)
stop_french = stopwords("french", source="stopwords-iso")
head(stop_french, 20)

Output:

[1] "a"           "abord"       "absolument"  "afin"        "ah"         
[6] "ai" "aie" "aient" "aies" "ailleurs"
[11] "ainsi" "ait" "allaient" "allo" "allons"
[16] "allô" "alors" "anterieur" "anterieure" "anterieures"

We have just printed the top 20 words.

For dutch language, we pass dutch.




stop_dutch = stopwords("Dutch", source="stopwords-iso")
print(head(stop_dutch, 20))

Output:

 [1] "aan"       "aangaande" "aangezien" "achte"     "achter"    "achterna" 
[7] "af" "afgelopen" "al" "aldaar" "aldus" "alhoewel"
[13] "alias" "alle" "allebei" "alleen" "alles" "als"
[19] "alsnog" "altijd"

Conclusion

In summary, removing stop words optimizes text data for search engines and machine learning models, improving efficiency and accuracy. The ‘tm’ library in R provides a robust toolset for this task, ensuring that common, non-informative words do not impede database storage or consume unnecessary processing time.


Article Tags :