Open In App

Data Cleaning & Transformation with Dplyr in R

Last Updated : 15 Nov, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In R, data formatting typically involves preparing and structuring your data in a way that is suitable for analysis or visualization. The exact steps for data formatting may vary depending on your specific dataset and the analysis you want to perform. Here are some common data formatting tasks in the R Programming Language.

Formatting data in R is an essential part of data preprocessing and analysis. Depending on our specific needs, we want to manipulate data types, change the structure of our data frame, or format variables. Here are some common tasks related to data formatting in R:

Import Data

Use functions like read.csv(), read.table(), read.xlsx() from packages like readr, readxl, or others to import our data into R. Ensure that our data is in a supported file format such as CSV, Excel, or a database.

Checking Data Structure

Use str() to check the structure of your data frame. This function displays the data type and structure of each column in your data frame.

str(your_data_frame)

Changing Data Types

Convert columns to appropriate data types using functions like as.numeric(), as.character(), as.Date(), etc. For example, to convert a column to a numeric type.

# Convert a numeric variable to character
df$numeric_var <- as.character(df$numeric_var)
# Convert a character variable to numeric
df$character_var <- as.numeric(df$character_var)
# Convert a variable to a factor (categorical)
df$factor_var <- as.factor(df$factor_var)

Handling Missing Values

Use functions like is.na(), complete.cases(), or na.omit() to identify and handle missing values (NA) appropriately. You can choose to remove rows with missing values or impute missing values with mean, median, or other strategies.

# Remove rows with missing values
your_data_frame <- your_data_frame[complete.cases(your_data_frame), ]
# Impute missing values with mean
your_data_frame$numeric_column[is.na(your_data_frame$numeric_column)] <- mean
(your_data_frame$numeric_column, na.rm = TRUE)

Renaming Columns

Rename columns using the colnames() function or by directly assigning new column names to the names() attribute of your data frame.

# Rename a column
colnames(your_data_frame)[colnames(your_data_frame) == "old_column_name"] <- "new_column_name"

Merging and Joining Data

Use functions like merge(), join(), or dplyr functions like left_join(), inner_join(), etc., to combine data from multiple sources based on common keys.

# Merge two data frames by a common key
merged_data <- merge(data_frame1, data_frame2, by = "common_key")

Formatting Dates and Times

Use functions like as.Date(), as.POSIXct(), or format() to format date and time columns as needed.

your_data_frame$date_column <- as.Date(your_data_frame$date_column, format = "%Y-%m-%d")

Aggregating Data

Use functions like aggregate() or dplyr functions like group_by() and summarize() to aggregate data based on specific variables.

library(dplyr)
aggregated_data <- your_data_frame %>%
group_by(category_column) %>%
summarize(mean_value = mean(numeric_column))

Remember that data formatting is often a crucial step in data analysis and can significantly impact the accuracy and success of your analyses and visualizations. The specific formatting steps you need to take depend on your dataset and analysis objectives.

Let’s go through some common data formatting tasks

Selecting Columns

We can select specific columns of the dataset using the $ operator or the subset() function.

R




# load the required package
library(dplyr)
 
# Select specific columns using the $ operator
selected_data <- mtcars$mpg
 
# Select specific columns using subset()
selected_data <- subset(mtcars, select = c(mpg, hp))
 
head(selected_data)


Output:

                   mpg  hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
Hornet 4 Drive 21.4 110
Hornet Sportabout 18.7 175
Valiant 18.1 105

Filtering Rows

We can filter rows based on certain conditions using the subset() or [ ] indexing.

R




# Filter rows where mpg is greater than 20
filtered_data <- subset(mtcars, mpg > 20)
 
# Alternatively, you can use [ ] indexing
filtered_data <- mtcars[mtcars$mpg > 20, ]
 
filtered_data


Output:

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Renaming Columns

We can rename columns using the colnames() function.

R




# Rename the "mpg" column to "MilesPerGallon"
colnames(mtcars)[colnames(mtcars) == "mpg"] <- "MilesPerGallon"
names(mtcars)


Output:

 [1] "MilesPerGallon" "cyl"            "disp"           "hp"             "drat"          
[6] "wt" "qsec" "vs" "am" "gear"
[11] "carb"

Sorting Data

You can sort the data by one or more columns using the order() function.

R




# Sort the data by mpg in descending order
sorted_data <- mtcars[order(-mtcars$MilesPerGallon), ]
 
head(sorted_data)


Output:

               MilesPerGallon cyl  disp  hp drat    wt  qsec vs am gear carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

Creating New Variables

We can create new variables based on existing ones.

R




# Create a new variable "MilesPerGallonCategory" based on "mpg"
mtcars$MilesPerGallonCategory <- ifelse(mtcars$mpg > 20, "High", "Low")
 
mtcars$MilesPerGallonCategory


Output:

 [1] "High" "High" "High" "High" "Low"  "Low"  "Low"  "High" "High" "Low"  "Low"  "Low"  "Low" 
[14] "Low" "Low" "Low" "Low" "High" "High" "High" "High" "Low" "Low" "Low" "Low" "High"
[27] "High" "High" "Low" "Low" "Low" "High"

Aggregating Data

We can summarize data using aggregation functions like mean(), sum(), etc.

R




# Calculate the mean mpg for each number of cylinders
mean_mpg_by_cyl <- aggregate(mtcars$MilesPerGallon, by = list(mtcars$cyl),
                             FUN = mean)
colnames(mean_mpg_by_cyl) <- c("Cylinders", "MeanMPG")
 
mean_mpg_by_cyl


Output:

  Cylinders  MeanMPG
1 4 26.66364
2 6 19.74286
3 8 15.10000

Merging Data

We can merge datasets using functions like merge().

R




# Create a second dataset for merging
df2 <- data.frame(Car = rownames(mtcars), Price = runif(nrow(mtcars),
                                          min = 10000, max = 50000))
# Merge the two datasets by the "Car" column
merged_data <- merge(mtcars, df2, by.x = "row.names", by.y = "Car")
 
head(merged_data)


Output:

           Row.names  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 AMC Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
2 Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
3 Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
4 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
5 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
6 Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
MilesPerGallonCategory Price
1 Low 49797.46
2 Low 35230.94
3 Low 14315.48
4 Low 36384.85
5 High 27144.52
6 Low 10147.37

These are some basic data formatting and manipulation tasks we can perform in R using built-in functions and datasets like mtcars. Depending on your specific data analysis needs, you may also want to explore packages like dplyr and tidyr for more advanced data manipulation tasks.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads