Open In App

Split comma-separated strings in a column into separate rows

Last Updated : 25 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Splitting comma-separated strings in a column into separate rows is a common task in data manipulation and analysis in R Programming Language. This transformation is useful when dealing with data where multiple values are concatenated within a single cell, and you want to separate them into distinct rows for further analysis or visualization.

This article explores various approaches to achieve this task, focusing on tidyr, dplyr, and base R methods. Let’s start with an example data frame that contains comma-separated strings in a specific column:

R
# Create a sample data frame
df <- data.frame(
  ID = c(1, 2, 3),
  Name = c("Aliyana", "Boby", "Charlie"),
  Skills = c("R,Python,SQL", "Excel,Tableau", "Java,C++,Python")
)
print(df)

Output:

  ID    Name          Skills
1  1 Aliyana    R,Python,SQL
2  2    Boby   Excel,Tableau
3  3 Charlie Java,C++,Python

In this example, the Skills column contains comma-separated strings representing the skills of each person. We aim to split these strings into separate rows.

Splitting with tidyr separate_rows

The separate_rows function from the tidyr package is designed for this purpose. It can split a column with delimiter-separated values into individual rows. Using separate_rows to Split Comma-Separated Strings

R
# Load tidyr package
library(tidyr)

# Create a sample data frame
df <- data.frame(
  ID = c(1, 2, 3),
  Name = c("Aliyana", "Boby", "Charlie"),
  Skills = c("R,Python,SQL", "Excel,Tableau", "Java,C++,Python")
)
print(df)
# Split the Skills column into separate rows
df_long <- separate_rows(df, Skills, sep = ",")
print(df_long)

Output:

  ID    Name          Skills
1  1 Aliyana    R,Python,SQL
2  2    Boby   Excel,Tableau
3  3 Charlie Java,C++,Python

# A tibble: 8 × 3
     ID Name    Skills 
  <dbl> <chr>   <chr>  
1     1 Aliyana R      
2     1 Aliyana Python 
3     1 Aliyana SQL    
4     2 Boby    Excel  
5     2 Boby    Tableau
6     3 Charlie Java   
7     3 Charlie C++    
8     3 Charlie Python

This code splits the Skills column into separate rows based on the comma delimiter, resulting in a data frame where each skill is in a separate row.

Splitting with dplyr and stringr

The combination of dplyr and stringr packages can also be used to achieve this task. You can first use stringr::str_split to split the strings and then use tidyr::unnest to expand the list into separate rows.

R
# Load dplyr and stringr packages
library(dplyr)
library(stringr)

# Create a sample data frame
df <- data.frame(
  ID = c(1, 2, 3),
  Name = c("Aliyana", "Boby", "Charlie"),
  Skills = c("R,Python,SQL", "Excel,Tableau", "Java,C++,Python")
)
print(df)
# Split the Skills column and unnest the results
df_long <- df %>%
  mutate(Skills = str_split(Skills, ",")) %>%
  unnest(Skills)
print(df_long)

Output:

  ID    Name          Skills
1  1 Aliyana    R,Python,SQL
2  2    Boby   Excel,Tableau
3  3 Charlie Java,C++,Python

# A tibble: 8 × 3
     ID Name    Skills 
  <dbl> <chr>   <chr>  
1     1 Aliyana R      
2     1 Aliyana Python 
3     1 Aliyana SQL    
4     2 Boby    Excel  
5     2 Boby    Tableau
6     3 Charlie Java   
7     3 Charlie C++    
8     3 Charlie Python

This code first splits the Skills column into a list of individual skills and then uses unnest to create separate rows for each skill.

Splitting with Base R

Base R provides a more manual approach for splitting comma-separated strings into separate rows. This approach involves using strsplit, rep, and unlist.

R
# Create a sample data frame
df <- data.frame(
  ID = c(1, 2, 3),
  Name = c("Aliyana", "Boby", "Charlie"),
  Skills = c("R,Python,SQL", "Excel,Tableau", "Java,C++,Python")
)
print(df)

# Split the Skills column using strsplit
skills_split <- strsplit(df$Skills, ",")
# Create a new data frame with repeated rows based on the length of split lists
df_long <- df[rep(seq_len(nrow(df)), sapply(skills_split, length)), ]
# Assign the expanded skills to the new data frame
df_long$Skills <- unlist(skills_split)
print(df_long)

Output:

  ID    Name          Skills
1 1 Aliyana R,Python,SQL
2 2 Boby Excel,Tableau
3 3 Charlie Java,C++,Python

ID Name Skills
1 1 Aliyana R
1.1 1 Aliyana Python
1.2 1 Aliyana SQL
2 2 Boby Excel
2.1 2 Boby Tableau
3 3 Charlie Java
3.1 3 Charlie C++
3.2 3 Charlie Python

This code splits the Skills column into lists, repeats the original data frame based on the length of the split lists, and then assigns the expanded skills to the new data frame.

Conclusion

Splitting comma-separated strings in a column into separate rows is a useful data transformation technique in R. This article presented three approaches to accomplish this task: using tidyr::separate_rows, the combination of dplyr and stringr, and a base R approach. Depending on your preferred method and requirements, you can select the most appropriate technique for your needs.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads