Open In App

Word Tokenization Using R

There is a need for a method to encode raw text as integers so that computation may be done on it to construct features for Supervised Machine Learning from natural language. Tokenization is usually the initial stage in this process of going from natural language to feature, or any type of text analysis. It is essential to understand the concepts of tokenization and tokens, as well as the associated idea of an n-gram, for practically every natural language processing activity.

Tokens are small units; they can be of various types:



Word Tokenization in R

Word Tokenization is a fundamental task in Natural Language Processing (NLP) and text analysis tasks. It is one of the crucial steps in NLP tasks that involves breaking down a text into smaller units, called tokens. These tokens can be words, sentences, or even individual characters. In word tokenization, tokens specifically mean words. Basically, blocks of text are broken down into individual words or word-level tokens. For example, the sentence “I love my dog” will be tokenized into the vector [“I”, “love”, “my”, “dog”]. R language provides various packages and functions to perform tokenization, each with its own set of features and capabilities.

When to use Word Tokenization

Word Tokenization is mostly used in NLP tasks like feature extraction sentiment analysis, machine translation, text classification, chatbots, text summarization, etc. It is used in places where individual words in a text are important. For example in Sentiment analysis, tokenization is used to identify the individual words in a text so that their sentiment can be analyzed; the words ‘happy’ and ‘sad’ would have different sentiment scores. In Machine translation, the text is broken down to words so that they can be translated into another language. Chatbots use word tokenization to understand the meaning of user input and to generate appropriate output. Word tokenization also allows us to extract words as features that can be further used in machine learning and statistical models, these words can also be used in building search engines.



Word tokenization is one of the crucial steps in text processing, as it helps in the normalization and standardization of the text data by splitting it into words.

Tips for Tokenization

General steps

General steps to perform tokenization in R:

  1. Install and Load Packages: Relevant packages, such as ‘tokenizers’, ‘tm’, or ‘quanteda’, must be installed and loaded.
  2. Load or Import Data: The text data that requires tokenization must be loaded. This could be in the form of a file, a data frame, or a text corpus.
  3. Text Preprocess: Lower casing, removing punctuation, and other types of text processing can be done. It is optional.
  4. Tokenization: The chosen tokenization method is used to break down the text into tokens.
  5. Analysis: Now various text analysis tasks can be done on the tokens.

There are several ways to perform word tokenization in R. There are several packages in R that have functions that assist in tokenization.

Word Tokenization in R using the ‘tokenizers’ package

The package provides functions such as ‘tokenize_words()’ for word tokenization. This is versatile and easy to use for breaking text into words. This is useful for extracting features from text data and performing word-level analysis.




# Install and load the 'tokenizers' package
install.packages("tokenizers")
library(tokenizers)
 
# Sample text
text <- "Welcome to Geeks for Geeks.Embark on an extraordinary coding odyssey with our
groundbreaking course,
DSA to Development - Complete Coding Guide! Discover the transformative power of mastering
Data Structures and Algorithms
(DSA) as you venture towards becoming a Proficient Developer."
 
# Tokenize the text into words
word_tokens <- unlist(tokenize_words(text))
 
# Print the result
print(word_tokens)

Output:

[1] "welcome" "to"      "geeks"   "for"     "geeks"  "embark"
[7] "on" "an" "extraordinary" "coding" "odyssey" "with" "our" 
[14] "groundbreaking" "course" "dsa" "to" "development" "complete" "coding"
[21] "guide" "discover" "the" "transformative" "power" "of" "mastering"
[28] "data" "structures" "and" "algorithms" "as" "you" "venture" "towards"
[35] "becoming" "a" "proficient" "developer"

‘tokenize_words()’ function

tokenize_words(x,lowercase=TRUE,stopwords=NULL,strip_punct=TRUE,strip_numeric=FALSE)

Tokenization with stopwords and Numbers Removal




#loading tokenizer package
library(tokenizers)
#sample text
text<-"welcome to GFG !@# 23"
#tokenization and stopwords , punctuations and  numric tokens removal
word_tokens<-unlist(tokenize_words(text,lowercase = TRUE,stopwords = ("to"),
                           strip_punct = TRUE, strip_numeric = TRUE,simplify = FALSE))
#printing the results
print(word_tokens)

Output:

[1] "welcome" "gfg"

Key features and functions of ‘tokenizers’ package

Word Tokenization in R using Using the ‘tm’ package:

This package has ‘wordTokenize()’ function. It is primarily designed for text mining but also includes functions for word tokenization. It has functions for preprocessing, exploring, and analyzing text data.




# Install and load the 'tm' package
install.packages("tm")
library(tm)
 
# Sample text
text <- "Welcome to Geeks for Geeks, this is the best platform for articles."
 
# Create a Corpus
corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
 
# Tokenize the text into words using the tokenizers package
words <- unlist(sapply(corpus, words))
 
# Print the result
print(words)

Output:

1,] "welcome" 
[2,] "geeks"   
[3,] "geeks"   
[4,] "best"    
[5,] "platform"
[6,] "artices"

Key features and functions of ‘tm’ package

Word Tokenization in R Using the ‘quanteda’ package

This package is known for its flexibility in text analysis and provides a ‘tokenize_words()’ function for word tokenization.




# Install and load the 'quanteda' package
install.packages("quanteda")
library(quanteda)
 
# Sample text
text <- "Geeks for Geeks."
 
# Tokenize the text into words
word_tokens <- tokenize_words(text)
 
# Print the result
print(word_tokens)

Output:

[1] "geeks" "for"   "geeks"

‘tokenize_word()’ function

tokenize_word(x, split_hyphens = FALSE, verbose = TRUE)

Key features and functions of ‘tm’ package

  1. Flexible Text Analysis: Quanteda is a powerful and flexible package for text analysis, allowing us to perform a wide range of text mining and natural language processing tasks. It offers features for text cleaning, tokenization, feature selection, and document modeling, making it suitable for a variety of text analysis projects.
  2. Document-Feature Matrix (DFM): Quanteda provides tools for creating and working with Document-Feature Matrices (DFMs), which are a fundamental data structure for text analysis. DFMs enable us to represent text documents in a numeric format, making it easy to perform statistical and machine learning analyses on text data.
  3. Customization: Quanteda allows for extensive customization through a variety of options and functions. We can tailor your text analysis workflows to your specific needs, whether it’s stemming, stopword removal.
  4. NLP and Linguistic Analysis: Quanteda supports advanced natural language processing tasks such as part-of-speech tagging, named entity recognition, collocation analysis, and sentiment analysis. This makes it a valuable tool for more in-depth linguistic and text analysis research.
  5. Community and Documentation: Quanteda has an active and supportive user community. It offers comprehensive documentation, tutorials, and examples, making it accessible for both beginners and experienced users in the field of text analysis.

Word Tokenization in R Using the ‘strsplit()’ function

The ‘strsplit()’ function in R is used to split the elements of a character vector into substrings based on specified delimiters or regular expressions.




# Sample character vector
text_vector <- c("apple,banana,cherry", "dog,cat,elephant", "red,green,blue")
 
# Split the elements of the character vector using a comma as the delimiter
split_text <- lapply(text_vector, function(x) unlist(strsplit(x, ",")))
 
# Print the result
print(split_text)

Output:

[[1]]
[1] "apple"  "banana" "cherry"

[[2]]
[1] "dog"      "cat"       "elephant"

[[3]]
[1] "red"   "green" "blue"

‘strsplit()’ function:

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)




# Sample character vector
text_vector <- c("apple,banana,cherry", "dog-cat:elephant", "red-green|blue")
 
# Split the elements of the character vector using regular expressions and different parameters
split_text <- lapply(
  text_vector,
  function(x) {
    # Split using regular expression: , or - or : or |
    unlist(strsplit(x, "[,|:-]", fixed = FALSE, perl = FALSE, useBytes = FALSE))
  }
)
 
# Print the result
print(split_text)

Output:

[[1]]
[1] "apple"  "banana" "cherry"

[[2]]
[1] "dog"      "cat"       "elephant"

[[3]]
[1] "red"   "green" "blue"

Word tokenization is a fundamental step in working with textual data. The method that can be used depends on the complexity of the data and ones specific need. Each of these packages offers a user-friendly way to do tokenization.


Article Tags :