Open In App

Working with Text in R

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Share
Report issue
Report

R Programming Language is used for statistical computing and is used by many data miners and statisticians for developing statistical software and data analysis. It includes machine learning algorithms, linear regression, time series, and statistical inference to name a few. R and its libraries implement a wide variety of statistical and graphical techniques, including linear and non-linear modeling, classical, statistical tests, time-series analysis, classification, clustering, and others. 

Any value written inside the double quote is treated as a string in R. String is an array of characters and these collections of characters are stored inside a variable. Internally R stores every string within double quotes, even when you create them with a single quote. 

Text Processing in R

Method 1: Using Built-in Type

In this method, we are using a built-in type for text processing in R.

Variable_name <- "String"

R




# R program to demonstrate
# creation of a string
a < -"hello world"
print(a)


Output:

"hello world"

Following is a list of rules that need to be followed while working with strings:  

  • The quotes at the beginning and end of a string should be both double quotes or both single quotes. They can not be mixed.
  • Double quotes can be inserted into a string starting and ending with a single quote.
  • A single quote can be inserted into a string starting and ending with double-quotes.

String Manipulation

String manipulation is a process where a user is asked to process a given string and use/change its data. There are different methods in R to manipulate string that are as follows:  

  • Concatenating of strings – paste() function: This function is used to combine strings in R. It can take n number of arguments to combine together.

Syntax: paste(….,  sep = ” “,  collapse =NULL )

Parameters: 

  • …..: It is used to pass n no of arguments to combine together.
  • sep: It is used to represent the separator between the arguments. It is optional.
  • collapse: It is used to remove the space between 2 strings, But not space within two words in one string.

R




# concatenate two strings
str1 <- "hello"
str2 <- "how are you?"
print(paste(str1, str2, sep = " ", collapse = "NULL"))


Output: 

"hello how are you?"
  • Formatting numbers and string – format() function: This function is used to format strings and numbers in a specified style.

Syntax: format(x, digits, nsmall, scientific, width, justify = c(“left”, “right”, “centre”, “none”)) 

Parameters:

  • x is the vector input.
  • digits here is the total number of digits displayed.
  • nsmall is the minimum number of digits to the right of the decimal point.
  • scientific is set to TRUE to display scientific notation.
  • width indicates the minimum width to be displayed by padding blanks in the beginning.
  • justify is the display of the string to left, right, or center.

R




# formatting numbers and strings
 
# Total number of digits displayed.
# Last digit rounded off.
result <- format(69.145656789, digits=9)
print(result)
 
# Display numbers in scientific notation.
result <- format(c(3, 132.84521),
                  scientific=TRUE)
print(result)
 
# The minimum number of digits
# to the right of the decimal point.
result <- format(96.47, nsmall=5)
print(result)
 
# Format treats everything as a string.
result <- format(8)
print(result)
 
# Numbers are padded with blank
# in the beginning for width.
result <- format(67.7, width=6)
print(result)
 
# Left justify strings.
result <- format("Hello", width=8,
                  justify="l")
print(result)


Output: 

[1] "69.1456568"
[1] "3.000000e+00" "1.328452e+02"
[1] "96.47000"
[1] "8"
[1] " 67.7"
[1] "Hello "

  • Counting the number of characters in the string – nchar() function: This function is used to count the number of characters and spaces in the string.

Syntax: nchar(x)

Parameter: 

  • x is the vector input here.

R




# to count the number of characters
# in the string
a <- nchar("hello world")
print(a)


Output: 

[1] 11

  • Changing the case of the string – toupper() & tolower() function: These function is used to change the case of the string. 

Syntax: toupper(x) and tolower(x)

Parameter:

  • x is the vector input

R




# Changing to Upper case.
a <- toupper("hello world")
print(a)
 
# Changing to lower case.
b <- tolower("HELLO WORLD")
print(b)


Output: 

"HELLO WORLD"
"hello world"

  • Extracting parts of the string – substring() function: This function is used to extract parts of the string.

Syntax: substring(x, first, last)

Parameters: 

  • x is the character vector input.
  • first is the position of the first character to be extracted.
  • last is the position of the last character to be extracted.

R




# Extract characters from 1th to 3rd position.
c <- substring("Programming", 1, 3)
print(c)


Output: 

"Pro"

Method 2: Using Tidyverse module

In this method, we will use the Tidyverse module, which includes all the packages required in the data science workflow, ranging from data exploration to data visualization. stringr is a library that has many functions used for data cleaning and data preparation tasks. It is also designed for working with strings and has many functions that make this an easy process. 

We are using this text for processing:

R




string <- c("WelcometoGeeksforgeeks!")


Example 1: Detect the string

In this example, we will detect the string using str_detect() method.

Syntax: str_detect( string, “text in string”)

Parameters:

  • String is the vector input

R




library(tidyverse)
 
str_detect(string, "geeks")


Output:

TRUE

Example 2: Locate the string

In this example, we will detect the string using str_locate() method.

Syntax: str_locate( string, “text in string”)

Parameters:

  • String is the vector input

R




library(tidyverse)
 
str_locate(string, "geeks")


Output:

start end
18 22

Example 3: Extract the string

In this example, we will detect the string using str_extract() method.

Syntax: str_extract( string, “text in string”)

Parameters:

  • String is the vector input

R




library(tidyverse)
 
str_extract(string, "for")


Output:

for

Example 4: Replace the string

In this example, we will detect the string using str_replace() method.

Syntax: str_replace( string, “text in string”)

Parameters:

  • String is the vector input

R




library(tidyverse)
 
str_replace(string, "toGeeksforgeeks", " geeks")


Output:

'Welcome geeks!'

Method 3: Using regex and external module

In this method, we are using regex using an external module like stringr.

Example 1: Select the character using dot

Here we will use dot (.) to select the character within the string.

R




string <- c("WelcometoGeeksforgeeks!")
 
str_extract_all(string, "G..k")


Output:

Geek

Example 2: Select the string using \\D

\\D is used to select any character and number in regex.

R




str_extract_all(string, "W\\D\\Dcome")


Output:

'Welcome'

Method 4: Using grep()

grep() function returns the index at which the pattern is found in the vector. If there are multiple occurrences of the pattern, it returns a list of indices of the occurrences. This is very useful as it not only tells us about the occurrence of the pattern but also of its location in the vector.
 

Syntax: grep(pattern, string, ignore.case=FALSE)

Parameters: 

  • pattern: A regular expressions pattern.
  • string: The character vector to be searched.
  • ignore.case: Whether to ignore case in the search. Here ignore.case is an optional parameter as is set to FALSE by default.

Example 1: To find all instances of specific words in the string.

R




str <- c("Hello", "hello", "hi", "hey")
grep('hey', str)


Output:

4

Example 2: To find all instances of specific words in the string irrespective of case

R




str <- c("Hello", "hello", "hi", "hey")
grep('he', str, ignore.case ="True")


Output:

[1] 1 2 4



Last Updated : 24 Nov, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads