Data Cleaning & Transformation with Dplyr in R

Last Updated : 15 Nov, 2023

In R, data formatting typically involves preparing and structuring your data in a way that is suitable for analysis or visualization. The exact steps for data formatting may vary depending on your specific dataset and the analysis you want to perform. Here are some common data formatting tasks in the R Programming Language.

Formatting data in R is an essential part of data preprocessing and analysis. Depending on our specific needs, we want to manipulate data types, change the structure of our data frame, or format variables. Here are some common tasks related to data formatting in R:

Import Data

Use functions like read.csv(), read.table(), read.xlsx() from packages like readr, readxl, or others to import our data into R. Ensure that our data is in a supported file format such as CSV, Excel, or a database.

Checking Data Structure

Use str() to check the structure of your data frame. This function displays the data type and structure of each column in your data frame.

str(your_data_frame)

Changing Data Types

Convert columns to appropriate data types using functions like as.numeric(), as.character(), as.Date(), etc. For example, to convert a column to a numeric type.

# Convert a numeric variable to character
df$numeric_var <- as.character(df$numeric_var)
# Convert a character variable to numeric
df$character_var <- as.numeric(df$character_var)
# Convert a variable to a factor (categorical)
df$factor_var <- as.factor(df$factor_var)

Handling Missing Values

Use functions like is.na(), complete.cases(), or na.omit() to identify and handle missing values (NA) appropriately. You can choose to remove rows with missing values or impute missing values with mean, median, or other strategies.

# Remove rows with missing values
your_data_frame <- your_data_frame[complete.cases(your_data_frame), ]
# Impute missing values with mean
your_data_frame$numeric_column[is.na(your_data_frame$numeric_column)] <- mean
(your_data_frame$numeric_column, na.rm = TRUE)

Renaming Columns

Rename columns using the colnames() function or by directly assigning new column names to the names() attribute of your data frame.

# Rename a column
colnames(your_data_frame)[colnames(your_data_frame) == "old_column_name"] <- "new_column_name"

Merging and Joining Data

Use functions like merge(), join(), or dplyr functions like left_join(), inner_join(), etc., to combine data from multiple sources based on common keys.

# Merge two data frames by a common key
merged_data <- merge(data_frame1, data_frame2, by = "common_key")

Formatting Dates and Times

Use functions like as.Date(), as.POSIXct(), or format() to format date and time columns as needed.

your_data_frame$date_column <- as.Date(your_data_frame$date_column, format = "%Y-%m-%d")

Aggregating Data

Use functions like aggregate() or dplyr functions like group_by() and summarize() to aggregate data based on specific variables.

library(dplyr)
aggregated_data <- your_data_frame %>%
  group_by(category_column) %>%
  summarize(mean_value = mean(numeric_column))

Remember that data formatting is often a crucial step in data analysis and can significantly impact the accuracy and success of your analyses and visualizations. The specific formatting steps you need to take depend on your dataset and analysis objectives.

Let’s go through some common data formatting tasks

Selecting Columns

We can select specific columns of the dataset using the $ operator or the subset() function.

R

# load the required package
library(dplyr)
 
# Select specific columns using the $ operator
selected_data <- mtcars$mpg
 
# Select specific columns using subset()
selected_data <- subset(mtcars, select = c(mpg, hp))
 
head(selected_data)

Output:

                   mpg  hp
Mazda RX4         21.0 110
Mazda RX4 Wag     21.0 110
Datsun 710        22.8  93
Hornet 4 Drive    21.4 110
Hornet Sportabout 18.7 175
Valiant           18.1 105

Filtering Rows

We can filter rows based on certain conditions using the subset() or [ ] indexing.

R

# Filter rows where mpg is greater than 20
filtered_data <- subset(mtcars, mpg > 20)
 
# Alternatively, you can use [ ] indexing
filtered_data <- mtcars[mtcars$mpg > 20, ]
 
filtered_data

Output:

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Renaming Columns

We can rename columns using the colnames() function.

R

# Rename the "mpg" column to "MilesPerGallon"
colnames(mtcars)[colnames(mtcars) == "mpg"] <- "MilesPerGallon"
names(mtcars)

Output:

 [1] "MilesPerGallon" "cyl"            "disp"           "hp"             "drat"          
 [6] "wt"             "qsec"           "vs"             "am"             "gear"          
[11] "carb"

Sorting Data

You can sort the data by one or more columns using the order() function.

R

# Sort the data by mpg in descending order
sorted_data <- mtcars[order(-mtcars$MilesPerGallon), ]
 
head(sorted_data)

Output:

               MilesPerGallon cyl  disp  hp drat    wt  qsec vs am gear carb
Toyota Corolla           33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat 128                 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic              30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Lotus Europa             30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Fiat X1-9                27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2            26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

Creating New Variables

We can create new variables based on existing ones.

R

# Create a new variable "MilesPerGallonCategory" based on "mpg"
mtcars$MilesPerGallonCategory <- ifelse(mtcars$mpg > 20, "High", "Low")
 
mtcars$MilesPerGallonCategory

Output:

 [1] "High" "High" "High" "High" "Low"  "Low"  "Low"  "High" "High" "Low"  "Low"  "Low"  "Low" 
[14] "Low"  "Low"  "Low"  "Low"  "High" "High" "High" "High" "Low"  "Low"  "Low"  "Low"  "High"
[27] "High" "High" "Low"  "Low"  "Low"  "High"

Aggregating Data

We can summarize data using aggregation functions like mean(), sum(), etc.

R

# Calculate the mean mpg for each number of cylinders
mean_mpg_by_cyl <- aggregate(mtcars$MilesPerGallon, by = list(mtcars$cyl), 
                             FUN = mean)
colnames(mean_mpg_by_cyl) <- c("Cylinders", "MeanMPG")
 
mean_mpg_by_cyl

Output:

  Cylinders  MeanMPG
1         4 26.66364
2         6 19.74286
3         8 15.10000

Merging Data

We can merge datasets using functions like merge().

R

# Create a second dataset for merging
df2 <- data.frame(Car = rownames(mtcars), Price = runif(nrow(mtcars), 
                                          min = 10000, max = 50000))
# Merge the two datasets by the "Car" column
merged_data <- merge(mtcars, df2, by.x = "row.names", by.y = "Car")
 
head(merged_data)

Output:

           Row.names  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1        AMC Javelin 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
2 Cadillac Fleetwood 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
3         Camaro Z28 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
4  Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
5         Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
6   Dodge Challenger 15.5   8  318 150 2.76 3.520 16.87  0  0    3    2
  MilesPerGallonCategory    Price
1                    Low 49797.46
2                    Low 35230.94
3                    Low 14315.48
4                    Low 36384.85
5                   High 27144.52
6                    Low 10147.37

These are some basic data formatting and manipulation tasks we can perform in R using built-in functions and datasets like mtcars. Depending on your specific data analysis needs, you may also want to explore packages like dplyr and tidyr for more advanced data manipulation tasks.

Suggest improvement

Data Wrangling in R Programming - Data Transformation

Share your thoughts in the comments

Data Cleaning & Transformation with Dplyr in R

Import Data

Checking Data Structure

Changing Data Types

Handling Missing Values

Renaming Columns

Merging and Joining Data

Formatting Dates and Times

Aggregating Data

Selecting Columns

R

Filtering Rows

R

Renaming Columns

R

Sorting Data

R

Creating New Variables

R

Aggregating Data

R

Merging Data

R

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?