Word Tokenization Using R

There is a need for a method to encode raw text as integers so that computation may be done on it to construct features for Supervised Machine Learning from natural language. Tokenization is usually the initial stage in this process of going from natural language to feature, or any type of text analysis. It is essential to understand the concepts of tokenization and tokens, as well as the associated idea of an n-gram, for practically every natural language processing activity.

Tokens are small units; they can be of various types:

Characters
Words
n-grams
Sentences
Paragraphs
Lines

Word Tokenization in R

Word Tokenization is a fundamental task in Natural Language Processing (NLP) and text analysis tasks. It is one of the crucial steps in NLP tasks that involves breaking down a text into smaller units, called tokens. These tokens can be words, sentences, or even individual characters. In word tokenization, tokens specifically mean words. Basically, blocks of text are broken down into individual words or word-level tokens. For example, the sentence “I love my dog” will be tokenized into the vector [“I”, “love”, “my”, “dog”]. R language provides various packages and functions to perform tokenization, each with its own set of features and capabilities.

When to use Word Tokenization

Word Tokenization is mostly used in NLP tasks like feature extraction sentiment analysis, machine translation, text classification, chatbots, text summarization, etc. It is used in places where individual words in a text are important. For example in Sentiment analysis, tokenization is used to identify the individual words in a text so that their sentiment can be analyzed; the words ‘happy’ and ‘sad’ would have different sentiment scores. In Machine translation, the text is broken down to words so that they can be translated into another language. Chatbots use word tokenization to understand the meaning of user input and to generate appropriate output. Word tokenization also allows us to extract words as features that can be further used in machine learning and statistical models, these words can also be used in building search engines.

Word tokenization is one of the crucial steps in text processing, as it helps in the normalization and standardization of the text data by splitting it into words.

Tips for Tokenization

Using a single type of tokenizer will make it easier to compare and combine the results of different tasks.
Considering language-specific tokenizer makes it more efficient.
Normalizing punctuation before tokenizing, will ensure that different punctuation symbols to be treated consistently.
Word tokenization tools are helpful, but are not perfect. Sometimes compound words are split into multiple tokens.

General steps

General steps to perform tokenization in R:

Install and Load Packages: Relevant packages, such as ‘tokenizers’, ‘tm’, or ‘quanteda’, must be installed and loaded.
Load or Import Data: The text data that requires tokenization must be loaded. This could be in the form of a file, a data frame, or a text corpus.
Text Preprocess: Lower casing, removing punctuation, and other types of text processing can be done. It is optional.
Tokenization: The chosen tokenization method is used to break down the text into tokens.
Analysis: Now various text analysis tasks can be done on the tokens.

There are several ways to perform word tokenization in R. There are several packages in R that have functions that assist in tokenization.

Word Tokenization in R using the ‘tokenizers’ package

The package provides functions such as ‘tokenize_words()’ for word tokenization. This is versatile and easy to use for breaking text into words. This is useful for extracting features from text data and performing word-level analysis.

# Install and load the 'tokenizers' package

install.packages("tokenizers")

library(tokenizers)
 
# Sample text

text <- "Welcome to Geeks for Geeks.Embark on an extraordinary coding odyssey with our 
groundbreaking course,
DSA to Development - Complete Coding Guide! Discover the transformative power of mastering
Data Structures and Algorithms
(DSA) as you venture towards becoming a Proficient Developer."
 
# Tokenize the text into words

word_tokens <- unlist(tokenize_words(text))
 
# Print the result

print(word_tokens)

Output:

[1] "welcome" "to"      "geeks"   "for"     "geeks"  "embark"
[7] "on" "an" "extraordinary" "coding" "odyssey" "with" "our" 
[14] "groundbreaking" "course" "dsa" "to" "development" "complete" "coding"
[21] "guide" "discover" "the" "transformative" "power" "of" "mastering"
[28] "data" "structures" "and" "algorithms" "as" "you" "venture" "towards"
[35] "becoming" "a" "proficient" "developer"

‘x'(requried): The input text that must be tokenized.
‘lowercase'(optional): It converts all words to lowercase. Default is set to TRUE.
‘stopwords'(optional): A character vector of stop words to be excluded. The words can be given by us.
‘strip_punct'(optional): It removes all punctuation. Default is set to FALSE, thus won’t remove any punctuation.
‘strip_numeric'(optional): It removes all the numbers. Default is FALSE, can be set to TRUE if there is no need of numbers.
‘Simplify'(optional): By default it is set to FALSE. If TRUE, then an input with a single element will return a character vector of tokens instead of a list.

‘tokenize_words()’ function

tokenize_words(x,lowercase=TRUE,stopwords=NULL,strip_punct=TRUE,strip_numeric=FALSE)

Tokenization with stopwords and Numbers Removal

#loading tokenizer package 

library(tokenizers)
#sample text

text<-"welcome to GFG !@# 23"
#tokenization and stopwords , punctuations and  numric tokens removal

word_tokens<-unlist(tokenize_words(text,lowercase = TRUE,stopwords = ("to"),

                           strip_punct = TRUE, strip_numeric = TRUE,simplify = FALSE))
#printing the results

print(word_tokens)

Output:

[1] "welcome" "gfg"

Here, ‘stopwords = (“to”)’ shows that the word ‘to’ must be excluded.
The ‘unlist’ function is then used to flatten this list into a single character vector of words.

Key features and functions of ‘tokenizers’ package

Sentence Tokenization: It provides ‘tokenize_sentences()’ function to split text into sentences. Useful for analysing text at sentence level.
N-grams: N-grams is sequence of n words that can be useful for capturing contextual information in text. The ‘tokenize_ngrams()’ function helps in creating n-grams from text.
Multi-lingual Support: ‘tokenizers’ can handle text in multiple languages and character encodings, making it suitable for a wide range of linguistic tasks.
It offers flexibility in tokenization patterns, making it suitable for different languages and text data formats.
The ability to tokenize text into words, sentences, or n-grams provides with a wide range of options for text analysis and feature extraction.

Word Tokenization in R using Using the ‘tm’ package:

This package has ‘wordTokenize()’ function. It is primarily designed for text mining but also includes functions for word tokenization. It has functions for preprocessing, exploring, and analyzing text data.

# Install and load the 'tm' package

install.packages("tm")

library(tm)
 
# Sample text

text <- "Welcome to Geeks for Geeks, this is the best platform for articles."
 
# Create a Corpus

corpus <- Corpus(VectorSource(text))

corpus <- tm_map(corpus, content_transformer(tolower))

corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, removeNumbers)

corpus <- tm_map(corpus, removeWords, stopwords("english"))
 
# Tokenize the text into words using the tokenizers package

words <- unlist(sapply(corpus, words))
 
# Print the result

print(words)

Output:

1,] "welcome" 
[2,] "geeks"   
[3,] "geeks"   
[4,] "best"    
[5,] "platform"
[6,] "artices"

‘corpus <- Corpus(VectorSource(text))’: This line creates a text corpus named ‘corpus’ using the ‘Corpus’ function from the ‘tm’ package. A corpus is a collection of text documents, and in this case, a corpus with a single document, which is the ‘text’.
‘corpus <- tm_map(corpus, content_transformer(tolower))’: Here, transformation is applied to the corpus to convert all text to lowercase. This step ensures that the text is uniform in terms of capitalization.
‘corpus <- tm_map(corpus, removePunctuation)’: This line removes punctuation from the text in the corpus. Punctuation marks are typically not considered as part of words, so they are removed as a preprocessing step.
‘corpus <- tm_map(corpus, removeNumbers)’: This step removes numerical digits from the text.
‘corpus <- tm_map(corpus, removeWords, stopwords(“english”))’: Here, the common English stop words from the text are removed. Stop words are words that are often removed from text for analysis because they are very common and may not carry much meaning. The `stopwords(“english”)` function provides a list of common English stop words.
‘words <- unlist(sapply(corpus, words))’: Finally, the individual words from the preprocessed corpus are extracted. The ‘sapply’ function applies the ‘words’ function to each document in the corpus, resulting in a list of words. The ‘unlist’ function is then used to flatten this list into a single character vector of words.

Key features and functions of ‘tm’ package

Word Tokenization: As mentioned above ‘wordTokenize()’ function helps in word tokenization.
Text Corpus Creation: The ‘tm’ package helps in text corpus creation. A corpua is a collection of text document.
Text Cleaning and Preprocessing: ‘tm’ provides a wide range of text preprocessing functions.
Text lowercase conversion.
Removal of stopwords (common words like ‘the’,’ and’,’ in’).
Stemming and lemmatization(reducing words to their base or root form eg.: the words ‘running’, ‘swimming’ will be reduced to ‘run’ and ‘swim’ respectively.
Punctuation and special characters removal.
Integration with NLP Packages: ‘tm’ can be integrated with other NLP packages and tools to perform more advanced text processing and analysis.
Text Mining Operations: The package supports various text mining operations.

Word Tokenization in R Using the ‘quanteda’ package

This package is known for its flexibility in text analysis and provides a ‘tokenize_words()’ function for word tokenization.

# Install and load the 'quanteda' package

install.packages("quanteda")

library(quanteda)
 
# Sample text

text <- "Geeks for Geeks."
 
# Tokenize the text into words

word_tokens <- tokenize_words(text)
 
# Print the result

print(word_tokens)

Output:

[1] "geeks" "for"   "geeks"

‘tokenize_word()’ function

tokenize_word(x, split_hyphens = FALSE, verbose = TRUE)

‘x ‘(required): This is the input text that must be tokenize into words. It should be a character vector, and the function will tokenize each element in the vector.
‘split_hyphens’ (optional): This parameter controls whether the function should split words that are connected by hyphens or hyphenation-like characters. If ‘split_hyphens’ is set to TRUE, these hyphenated words will be separated into individual tokens. If it’s set to FALSE, the hyphenated words will be kept as a single token.
‘verbose’ (optional): If verbose is set to TRUE, the function will print timing messages to the console, which can be helpful for monitoring the tokenization process, especially when working with large texts.

Key features and functions of ‘tm’ package

Flexible Text Analysis: Quanteda is a powerful and flexible package for text analysis, allowing us to perform a wide range of text mining and natural language processing tasks. It offers features for text cleaning, tokenization, feature selection, and document modeling, making it suitable for a variety of text analysis projects.
Document-Feature Matrix (DFM): Quanteda provides tools for creating and working with Document-Feature Matrices (DFMs), which are a fundamental data structure for text analysis. DFMs enable us to represent text documents in a numeric format, making it easy to perform statistical and machine learning analyses on text data.
Customization: Quanteda allows for extensive customization through a variety of options and functions. We can tailor your text analysis workflows to your specific needs, whether it’s stemming, stopword removal.
NLP and Linguistic Analysis: Quanteda supports advanced natural language processing tasks such as part-of-speech tagging, named entity recognition, collocation analysis, and sentiment analysis. This makes it a valuable tool for more in-depth linguistic and text analysis research.
Community and Documentation: Quanteda has an active and supportive user community. It offers comprehensive documentation, tutorials, and examples, making it accessible for both beginners and experienced users in the field of text analysis.

Word Tokenization in R Using the ‘strsplit()’ function

The ‘strsplit()’ function in R is used to split the elements of a character vector into substrings based on specified delimiters or regular expressions.

# Sample character vector

text_vector <- c("apple,banana,cherry", "dog,cat,elephant", "red,green,blue")
 
# Split the elements of the character vector using a comma as the delimiter

split_text <- lapply(text_vector, function(x) unlist(strsplit(x, ",")))
 
# Print the result

print(split_text)

Output:

[[1]]
[1] "apple"  "banana" "cherry"

[[2]]
[1] "dog"      "cat"       "elephant"

[[3]]
[1] "red"   "green" "blue"

The ‘lapply()’ function is used to apply a function to each element of the text_vector.
The function used in ‘lapply()’ is an anonymous function that takes each element x of the text_vector.
Within the anonymous function:
-> ‘strsplit(x, “,”)’ is used to split the string x into substrings wherever a comma , is encountered. This function returns a list of character vectors, where each element of the list corresponds to the split parts of the input string.
-> ‘unlist()’ is used to convert this list of character vectors into a single character vector. It flattens the list and removes the list structure.

‘strsplit()’ function:

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

‘x’ (required): This is a character vector where each element represents the text that must be split into substrings.
‘split’ (required): A character vector containing regular expressions (unless fixed = TRUE) that shows how the splitting should occur. If empty matches are found in the input text x will be split into single characters. If split has a length greater than 1, it is recycled along x.
‘fixed’ (optional): A logical parameter. If it is set to TRUE, it matches split exactly without assuming it as a regular expression. It takes precedence over perl, when set to TRUE.
‘perl’ (optional): A logical parameter. It determines whether Perl-compatible regular expressions (these are a powerful and widely used pattern matching and text processing feature found in many programming languages, including R) should be used. If perl is TRUE, it takes precedence over split.
‘useBytes’ (optional): A logical parameter. If set to TRUE, matching is done byte-by-byte rather than character-by-character.

# Sample character vector

text_vector <- c("apple,banana,cherry", "dog-cat:elephant", "red-green|blue")
 
# Split the elements of the character vector using regular expressions and different parameters

split_text <- lapply(

  text_vector,

  function(x) {

    # Split using regular expression: , or - or : or |

    unlist(strsplit(x, "[,|:-]", fixed = FALSE, perl = FALSE, useBytes = FALSE))

  }
)
 
# Print the result

print(split_text)

Output:

[[1]]
[1] "apple"  "banana" "cherry"

[[2]]
[1] "dog"      "cat"       "elephant"

[[3]]
[1] "red"   "green" "blue"

‘strsplit()’ function is used to split each element using a regular expression pattern [,|:-]. This pattern specifies that we want to split the string wherever a comma (,), hyphen (-), colon (:), or pipe (|) is encountered.
fixed = FALSE: this allows the use of regular expressions to define the splitting pattern.
perl = FALSE: this do not use Perl-compatible regular expressions.
useBytes = FALSE: this do not perform matching byte-by-byte.
In this output, each element of the list represents the split parts of the corresponding input string, split using the regular expression pattern ‘[,|:-]’. The various delimiters in the input strings were correctly used to split the text into substrings, demonstrating the use of different parameters of the ‘strsplit()’ function.

Word tokenization is a fundamental step in working with textual data. The method that can be used depends on the complexity of the data and ones specific need. Each of these packages offers a user-friendly way to do tokenization.

Article Tags :

Geeks Premier League

R Language

Geeks Premier League 2023