R Programming Language is used for statistical computing and is used by many data miners and statisticians for developing statistical software and data analysis. It includes machine learning algorithms, linear regression, time series, and statistical inference to name a few. R and its libraries implement a wide variety of statistical and graphical techniques, including linear and non-linear modeling, classical, statistical tests, time-series analysis, classification, clustering, and others.
Any value written inside the double quote is treated as a string in R. String is an array of characters and these collections of characters are stored inside a variable. Internally R stores every string within double quotes, even when you create them with a single quote.
Text Processing in R
Method 1: Using Built-in Type
In this method, we are using a built-in type for text processing in R.
Variable_name <- "String"
R
a < - "hello world"
print (a)
|
Output:
"hello world"
Following is a list of rules that need to be followed while working with strings:
- The quotes at the beginning and end of a string should be both double quotes or both single quotes. They can not be mixed.
- Double quotes can be inserted into a string starting and ending with a single quote.
- A single quote can be inserted into a string starting and ending with double-quotes.
String Manipulation
String manipulation is a process where a user is asked to process a given string and use/change its data. There are different methods in R to manipulate string that are as follows:
- Concatenating of strings – paste() function: This function is used to combine strings in R. It can take n number of arguments to combine together.
Syntax: paste(…., sep = ” “, collapse =NULL )
Parameters:
- …..: It is used to pass n no of arguments to combine together.
- sep: It is used to represent the separator between the arguments. It is optional.
- collapse: It is used to remove the space between 2 strings, But not space within two words in one string.
R
str1 <- "hello"
str2 <- "how are you?"
print ( paste (str1, str2, sep = " " , collapse = "NULL" ))
|
Output:
"hello how are you?"
- Formatting numbers and string – format() function: This function is used to format strings and numbers in a specified style.
Syntax: format(x, digits, nsmall, scientific, width, justify = c(“left”, “right”, “centre”, “none”))
Parameters:
- x is the vector input.
- digits here is the total number of digits displayed.
- nsmall is the minimum number of digits to the right of the decimal point.
- scientific is set to TRUE to display scientific notation.
- width indicates the minimum width to be displayed by padding blanks in the beginning.
- justify is the display of the string to left, right, or center.
R
result <- format (69.145656789, digits=9)
print (result)
result <- format ( c (3, 132.84521),
scientific= TRUE )
print (result)
result <- format (96.47, nsmall=5)
print (result)
result <- format (8)
print (result)
result <- format (67.7, width=6)
print (result)
result <- format ( "Hello" , width=8,
justify= "l" )
print (result)
|
Output:
[1] "69.1456568"
[1] "3.000000e+00" "1.328452e+02"
[1] "96.47000"
[1] "8"
[1] " 67.7"
[1] "Hello "
- Counting the number of characters in the string – nchar() function: This function is used to count the number of characters and spaces in the string.
Syntax: nchar(x)
Parameter:
- x is the vector input here.
R
a <- nchar ( "hello world" )
print (a)
|
Output:
[1] 11
- Changing the case of the string – toupper() & tolower() function: These function is used to change the case of the string.
Syntax: toupper(x) and tolower(x)
Parameter:
R
a <- toupper ( "hello world" )
print (a)
b <- tolower ( "HELLO WORLD" )
print (b)
|
Output:
"HELLO WORLD"
"hello world"
- Extracting parts of the string – substring() function: This function is used to extract parts of the string.
Syntax: substring(x, first, last)
Parameters:
- x is the character vector input.
- first is the position of the first character to be extracted.
- last is the position of the last character to be extracted.
R
c <- substring ( "Programming" , 1, 3)
print (c)
|
Output:
"Pro"
Method 2: Using Tidyverse module
In this method, we will use the Tidyverse module, which includes all the packages required in the data science workflow, ranging from data exploration to data visualization. stringr is a library that has many functions used for data cleaning and data preparation tasks. It is also designed for working with strings and has many functions that make this an easy process.
We are using this text for processing:
R
string <- c ( "WelcometoGeeksforgeeks!" )
|
Example 1: Detect the string
In this example, we will detect the string using str_detect() method.
Syntax: str_detect( string, “text in string”)
Parameters:
- String is the vector input
R
library (tidyverse)
str_detect (string, "geeks" )
|
Output:
TRUE
Example 2: Locate the string
In this example, we will detect the string using str_locate() method.
Syntax: str_locate( string, “text in string”)
Parameters:
- String is the vector input
R
library (tidyverse)
str_locate (string, "geeks" )
|
Output:
start end
18 22
Example 3: Extract the string
In this example, we will detect the string using str_extract() method.
Syntax: str_extract( string, “text in string”)
Parameters:
- String is the vector input
R
library (tidyverse)
str_extract (string, "for" )
|
Output:
for
Example 4: Replace the string
In this example, we will detect the string using str_replace() method.
Syntax: str_replace( string, “text in string”)
Parameters:
- String is the vector input
R
library (tidyverse)
str_replace (string, "toGeeksforgeeks" , " geeks" )
|
Output:
'Welcome geeks!'
Method 3: Using regex and external module
In this method, we are using regex using an external module like stringr.
Example 1: Select the character using dot
Here we will use dot (.) to select the character within the string.
R
string <- c ( "WelcometoGeeksforgeeks!" )
str_extract_all (string, "G..k" )
|
Output:
Geek
Example 2: Select the string using \\D
\\D is used to select any character and number in regex.
R
str_extract_all (string, "W\\D\\Dcome" )
|
Output:
'Welcome'
Method 4: Using grep()
grep() function returns the index at which the pattern is found in the vector. If there are multiple occurrences of the pattern, it returns a list of indices of the occurrences. This is very useful as it not only tells us about the occurrence of the pattern but also of its location in the vector.
Syntax: grep(pattern, string, ignore.case=FALSE)
Parameters:
- pattern: A regular expressions pattern.
- string: The character vector to be searched.
- ignore.case: Whether to ignore case in the search. Here ignore.case is an optional parameter as is set to FALSE by default.
Example 1: To find all instances of specific words in the string.
R
str <- c ( "Hello" , "hello" , "hi" , "hey" )
grep ( 'hey' , str)
|
Output:
4
Example 2: To find all instances of specific words in the string irrespective of case
R
str <- c ( "Hello" , "hello" , "hi" , "hey" )
grep ( 'he' , str, ignore.case = "True" )
|
Output:
[1] 1 2 4
Last Updated :
24 Nov, 2023
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...