Basic Operations using dplyr

Last Updated : 30 Apr, 2024

The dplyr package in R Programming Language is a powerful toolkit for data manipulation. It simplifies common tasks like filtering, selecting, mutating, summarizing, and joining datasets, making data wrangling a breeze. Here we explore the core functions of dplyr and explore how they can be used to manipulate and transform data effectively.

What is dplyr?

dplyr is an R package developed by Hadley Wickham, renowned for its simplicity and effectiveness in data manipulation tasks. It provides a set of functions that streamline common data manipulation operations, making data-wrangling tasks more efficient and less error-rated.

Features of dplyr

Simplifies data manipulation tasks with an intuitive syntax.
Optimized for speed and efficiency, making it ideal for large datasets.
Integrates seamlessly with other packages like ggplot2 and tidyr for comprehensive data analysis.

Step 1: Load the Required Libaries and Dataset

# Load the dplyr package
library(dplyr)

# Preview the first few rows of the mtcars dataset
head(mtcars)

Output:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

1.Filtering Rows

We can filter rows based on specific conditions. Suppose we want to filter cars with more than 30 miles per gallon (mpg)

filtered_cars <- filter(mtcars, mpg > 30)
print(filtered_cars)

Output:

                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

2. Selecting Columns

To select specific columns of interest, we use the select() function. Select only the car model (model) and miles per gallon (mpg).

# Selecting columns
selected_cars <- select(mtcars, mpg)
print(selected_cars)

Output:

                     mpg
Mazda RX4           21.0
Mazda RX4 Wag       21.0
Datsun 710          22.8
Hornet 4 Drive      21.4
Hornet Sportabout   18.7
Valiant             18.1
Duster 360          14.3
Merc 240D           24.4
Merc 230            22.8
Merc 280            19.2
Merc 280C           17.8
Merc 450SE          16.4
Merc 450SL          17.3

3. Mutating Data

With mutate(), we can create new columns based on existing ones. Now create a new column called hp_per_cyl representing the horsepower per cylinder.

mutated_cars <- mutate(mtcars, hp_per_cyl = hp / cyl)
print(mutated_cars)

Output:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb hp_per_cyl
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   18.33333
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   18.33333
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   23.25000
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   18.33333
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   21.87500
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   17.50000
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4   30.62500

4. Arranging Rows

To sort rows based on a specific column, we use arrange(). Let’s arrange the cars by descending order of horsepower (hp).

arranged_cars <- arrange(mtcars, desc(hp))
print(arranged_cars)

Output:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

5. Summarizing Data

summarize() is used to compute summary statistics. Calculate the mean and median of miles per gallon (mpg).

summary_stats <- summarize(mtcars, mean_mpg = mean(mpg), median_mpg = median(mpg))
print(summary_stats)

Output:

  mean_mpg median_mpg
1 20.09062       19.2

6. Grouping Data

When performing aggregate operations by groups, we use group_by(). Group the cars by number of cylinders (cyl) and compute the mean miles per gallon (mpg) within each group.

grouped_cars <- group_by(mtcars, cyl)
summary_stats_by_cyl <- summarize(grouped_cars, mean_mpg = mean(mpg))
print(summary_stats_by_cyl)

Output:

# A tibble: 3 × 2
    cyl mean_mpg
  <dbl>    <dbl>
1     4     26.7
2     6     19.7
3     8     15.1

7. Chaining (%>% operator)

Suppose we perform multiple data manipulation tasks in sequence, like filtering, then arranging, then summarizing. Instead of storing the intermediate results in variables, we can “chain” the functions together using the %>% operator. It passes the result of one function as the first argument to the next function.

mtcars %>%
  filter(mpg > 20) %>%
  arrange(desc(hp)) %>%
  summarize(mean_mpg = mean(mpg))

Output:

  mean_mpg
1 25.47857

In the above code , first filters cars with mpg greater than 20, then arranges them by descending horsepower, and finally computes the mean mpg, all in one go.

Joining with dplyr

Joins are used to combine data from multiple tables based on common columns. dplyr provides several functions for performing different types of joins, including inner joins, left joins, right joins, and full joins.

Inner Join (inner_join()): Returns only the rows that have matching values in both tables.
Left Join (left_join()): Returns all rows from the left table and matching rows from the right table. If there are no matching rows in the right table, NA values are filled in.
Right Join (right_join()): Returns all rows from the right table and matching rows from the left table. If there are no matching rows in the left table, NA values are filled in.
Full Join (full_join()): Returns all rows from both tables, combining them where there are matching values and filling in NA values where there are no matches.

library(dplyr)

# Create two datasets
df1 <- data.frame(ID = c(1, 2, 3),
                  Name = c("Ali", "Boby", "Charles"))

df2 <- data.frame(ID = c(2, 3, 4),
                  Age = c(25, 30, 35))

# Inner join
inner_result <- inner_join(df1, df2, by = "ID")

# Left join
left_result <- left_join(df1, df2, by = "ID")

# Right join
right_result <- right_join(df1, df2, by = "ID")

# Full join
full_result <- full_join(df1, df2, by = "ID")

# Print all results with headers
print("Inner Join Result:")
print(inner_result)

print("Left Join Result:")
print(left_result)

print("Right Join Result:")
print(right_result)

print("Full Join Result:")
print(full_result)

Output:

[1] "Inner Join Result:"
  ID    Name Age
1  2     Bob  25
2  3 Charlie  30

[1] "Left Join Result:"
  ID    Name Age
1  1   Alice  NA
2  2     Bob  25
3  3 Charlie  30

[1] "Right Join Result:"
  ID    Name Age
1  2     Bob  25
2  3 Charlie  30
3  4    <NA>  35

[1] "Full Join Result:"
  ID    Name Age
1  1   Alice  NA
2  2     Bob  25
3  3 Charlie  30
4  4    <NA>  35

Advantages of dplyr package

Readable Syntax: dplyr employs a straightforward grammar, making code more readable and easier to understand, even for those new to R.
Efficiency: The underlying C++ implementations of dplyr functions make them significantly faster compared to their base R counterparts, especially for large datasets.
Integration with Other Packages: dplyr seamlessly integrates with other popular R packages, such as ggplot2 for visualization and tidyr for data tidying, forming a powerful ecosystem for data analysis.
Consistency: The consistent naming conventions and function behaviors across dplyr functions simplify the learning curve and enhance code maintainability.

Conclusion

The dplyr package is like a handy toolkit for data manipulation in R. With its easy-to-understand functions, we can make your data analysis smoother, faster, and more insightful. By trying out different functions with sample datasets, we’ll become a expert at using dplyr to handle all sorts of data tasks.

FAQs

What is data manipulation?

Data manipulation involves organizing, cleaning, and transforming data to make it useful for analysis.

Why do we need to manipulate data?

Manipulating data helps us prepare it for analysis by removing errors, handling missing values, and restructuring it to extract meaningful insights.

How do I optimize code for data manipulation in R?

Optimize your code by using efficient functions, minimizing memory usage, and breaking down complex tasks into smaller steps.

Can I handle large datasets in R?

Yes, you can handle large datasets in R using packages like `data.table` or `disk.frame` that are optimized for performance and memory usage.

How do I stay updated on data manipulation techniques in R?

Stay updated by following online resources, participating in R communities, and reading articles and tutorials on data manipulation techniques.

Suggest improvement

DataFrame Operations in R

Compute Summary Statistics In R

Share your thoughts in the comments

Basic Operations using dplyr

What is dplyr?

Features of dplyr

Step 1: Load the Required Libaries and Dataset

1.Filtering Rows

2. Selecting Columns

3. Mutating Data

4. Arranging Rows

5. Summarizing Data

6. Grouping Data

7. Chaining (%>% operator)

Joining with dplyr

Advantages of dplyr package

Conclusion

FAQs

What is data manipulation?

Why do we need to manipulate data?

How do I optimize code for data manipulation in R?

Can I handle large datasets in R?

How do I stay updated on data manipulation techniques in R?

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?