Open In App

Basic Operations using dplyr

Last Updated : 30 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

The dplyr package in R Programming Language is a powerful toolkit for data manipulation. It simplifies common tasks like filtering, selecting, mutating, summarizing, and joining datasets, making data wrangling a breeze. Here we explore the core functions of dplyr and explore how they can be used to manipulate and transform data effectively.

What is dplyr?

dplyr is an R package developed by Hadley Wickham, renowned for its simplicity and effectiveness in data manipulation tasks. It provides a set of functions that streamline common data manipulation operations, making data-wrangling tasks more efficient and less error-rated.

Features of dplyr

  1. Simplifies data manipulation tasks with an intuitive syntax.
  2. Optimized for speed and efficiency, making it ideal for large datasets.
  3. Integrates seamlessly with other packages like ggplot2 and tidyr for comprehensive data analysis.

Step 1: Load the Required Libaries and Dataset

R
# Load the dplyr package
library(dplyr)

# Preview the first few rows of the mtcars dataset
head(mtcars)

Output:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

1.Filtering Rows

We can filter rows based on specific conditions. Suppose we want to filter cars with more than 30 miles per gallon (mpg)

R
filtered_cars <- filter(mtcars, mpg > 30)
print(filtered_cars)

Output:

                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

2. Selecting Columns

To select specific columns of interest, we use the select() function. Select only the car model (model) and miles per gallon (mpg).

R
# Selecting columns
selected_cars <- select(mtcars, mpg)
print(selected_cars)

Output:

                     mpg
Mazda RX4           21.0
Mazda RX4 Wag       21.0
Datsun 710          22.8
Hornet 4 Drive      21.4
Hornet Sportabout   18.7
Valiant             18.1
Duster 360          14.3
Merc 240D           24.4
Merc 230            22.8
Merc 280            19.2
Merc 280C           17.8
Merc 450SE          16.4
Merc 450SL          17.3

3. Mutating Data

With mutate(), we can create new columns based on existing ones. Now create a new column called hp_per_cyl representing the horsepower per cylinder.

R
mutated_cars <- mutate(mtcars, hp_per_cyl = hp / cyl)
print(mutated_cars)

Output:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb hp_per_cyl
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 18.33333
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 18.33333
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 23.25000
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 18.33333
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 21.87500
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 17.50000
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 30.62500

4. Arranging Rows

To sort rows based on a specific column, we use arrange(). Let’s arrange the cars by descending order of horsepower (hp).

R
arranged_cars <- arrange(mtcars, desc(hp))
print(arranged_cars)

Output:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4

5. Summarizing Data

summarize() is used to compute summary statistics. Calculate the mean and median of miles per gallon (mpg).

R
summary_stats <- summarize(mtcars, mean_mpg = mean(mpg), median_mpg = median(mpg))
print(summary_stats)

Output:

  mean_mpg median_mpg
1 20.09062 19.2

6. Grouping Data

When performing aggregate operations by groups, we use group_by(). Group the cars by number of cylinders (cyl) and compute the mean miles per gallon (mpg) within each group.

R
grouped_cars <- group_by(mtcars, cyl)
summary_stats_by_cyl <- summarize(grouped_cars, mean_mpg = mean(mpg))
print(summary_stats_by_cyl)

Output:

# A tibble: 3 × 2
cyl mean_mpg
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1

7. Chaining (%>% operator)

Suppose we perform multiple data manipulation tasks in sequence, like filtering, then arranging, then summarizing. Instead of storing the intermediate results in variables, we can “chain” the functions together using the %>% operator. It passes the result of one function as the first argument to the next function.

R
mtcars %>%
  filter(mpg > 20) %>%
  arrange(desc(hp)) %>%
  summarize(mean_mpg = mean(mpg))

Output:

  mean_mpg
1 25.47857

In the above code , first filters cars with mpg greater than 20, then arranges them by descending horsepower, and finally computes the mean mpg, all in one go.

Joining with dplyr

Joins are used to combine data from multiple tables based on common columns. dplyr provides several functions for performing different types of joins, including inner joins, left joins, right joins, and full joins.

  1. Inner Join (inner_join()): Returns only the rows that have matching values in both tables.
  2. Left Join (left_join()): Returns all rows from the left table and matching rows from the right table. If there are no matching rows in the right table, NA values are filled in.
  3. Right Join (right_join()): Returns all rows from the right table and matching rows from the left table. If there are no matching rows in the left table, NA values are filled in.
  4. Full Join (full_join()): Returns all rows from both tables, combining them where there are matching values and filling in NA values where there are no matches.
R
library(dplyr)

# Create two datasets
df1 <- data.frame(ID = c(1, 2, 3),
                  Name = c("Ali", "Boby", "Charles"))

df2 <- data.frame(ID = c(2, 3, 4),
                  Age = c(25, 30, 35))

# Inner join
inner_result <- inner_join(df1, df2, by = "ID")

# Left join
left_result <- left_join(df1, df2, by = "ID")

# Right join
right_result <- right_join(df1, df2, by = "ID")

# Full join
full_result <- full_join(df1, df2, by = "ID")

# Print all results with headers
print("Inner Join Result:")
print(inner_result)

print("Left Join Result:")
print(left_result)

print("Right Join Result:")
print(right_result)

print("Full Join Result:")
print(full_result)

Output:

[1] "Inner Join Result:"
  ID    Name Age
1  2     Bob  25
2  3 Charlie  30

[1] "Left Join Result:"
  ID    Name Age
1  1   Alice  NA
2  2     Bob  25
3  3 Charlie  30

[1] "Right Join Result:"
  ID    Name Age
1  2     Bob  25
2  3 Charlie  30
3  4    <NA>  35

[1] "Full Join Result:"
  ID    Name Age
1  1   Alice  NA
2  2     Bob  25
3  3 Charlie  30
4  4    <NA>  35

Advantages of dplyr package

  1. Readable Syntax: dplyr employs a straightforward grammar, making code more readable and easier to understand, even for those new to R.
  2. Efficiency: The underlying C++ implementations of dplyr functions make them significantly faster compared to their base R counterparts, especially for large datasets.
  3. Integration with Other Packages: dplyr seamlessly integrates with other popular R packages, such as ggplot2 for visualization and tidyr for data tidying, forming a powerful ecosystem for data analysis.
  4. Consistency: The consistent naming conventions and function behaviors across dplyr functions simplify the learning curve and enhance code maintainability.

Conclusion

The dplyr package is like a handy toolkit for data manipulation in R. With its easy-to-understand functions, we can make your data analysis smoother, faster, and more insightful. By trying out different functions with sample datasets, we’ll become a expert at using dplyr to handle all sorts of data tasks.

FAQs

What is data manipulation?

Data manipulation involves organizing, cleaning, and transforming data to make it useful for analysis.

Why do we need to manipulate data?

Manipulating data helps us prepare it for analysis by removing errors, handling missing values, and restructuring it to extract meaningful insights.

How do I optimize code for data manipulation in R?

Optimize your code by using efficient functions, minimizing memory usage, and breaking down complex tasks into smaller steps.

Can I handle large datasets in R?

Yes, you can handle large datasets in R using packages like `data.table` or `disk.frame` that are optimized for performance and memory usage.

How do I stay updated on data manipulation techniques in R?

Stay updated by following online resources, participating in R communities, and reading articles and tutorials on data manipulation techniques.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads