Open In App

Apply a Function (or functions) across Multiple Columns using dplyr in R

Improve
Improve
Like Article
Like
Save
Share
Report

Data processing and manipulation are one of the core tasks in data science and machine learning. R Programming Language is one of the widely used programming languages for data science, and dplyr package is one of the most popular packages in R for data manipulation. In this article, we will learn how to apply a function (or functions) across multiple columns in R using the dplyr package.

What is dplyr?

dplyr is a powerful and efficient data manipulation package in R. It provides a set of functions for filtering, grouping, and transforming data. The functions in dplyr are designed to be simple and intuitive, making it easy to perform complex data manipulations with a few lines of code.

Prerequisites

Before we start, make sure that you have dplyr package installed in your system. If not, install it by running the following code:

install.packages("dplyr")

Once you have dplyr installed, you can load it into your R environment by running the following code:

library(dplyr)

Applying a Function to a Single Column

Let’s start by applying a function to a single column. For this, we will use the built-in mtcars data set. You can load this data set by running the following code. The mtcars data set contains information about various car models, including their miles per gallon (mpg) ratings. Let’s say we want to calculate the logarithm of the mpg column. We can do this using the mutate function from the dplyr package.

R




data("mtcars")
  
mtcars_log_mpg <- mtcars %>% 
 mutate(log_mpg = log(mpg))


The mutate function takes the data frame mtcars as input and adds a new column log_mpg with the logarithm of the mpg column. The %>% operator is the pipe operator, which passes the output of the previous operation as the first argument to the next operation.

Let’s visualize the changes brought by this transformation using a bar plot:

R




par(mfrow=c(1,2))
barplot(mtcars$mpg, main="Original mpg")
barplot(mtcars_log_mpg$log_mpg, main="log(mpg)")


OUTPUT:

Bar Chart for the original mtcars

\

This bar plot shows the original mpg column and its logarithm side by side, which helps us understand the changes brought by the logarithm function.

As we can see, the logarithm function reduces the range of values, which can be useful in some cases where the original values have a large range. In this case, the logarithm function brings the values of the mpg column closer to each other, which can make it easier to see patterns and relationships in the data.

Applying a Function to Multiple Columns

In the previous section, we learned how to apply a function to a single column. But what if we want to apply the same function to multiple columns in a data frame? For this, we can use the mutate_all function from the dplyr package. The mutate_all function takes a data frame as input and applies a function to all columns.Let’s say we have a data frame df with three columns, and we want to apply the logarithm function to all columns.

The mutate_all function applies the logarithm function to all columns in the data frame and returns a new data frame with the same number of columns, but with the logarithm of each column. To visually represent the changes brought by applying the logarithm function to all columns, we can plot the original data and the transformed data side by side:

R




df <- data.frame(col1 = runif(10),
                 col2 = runif(10),
                 col3 = runif(10))
df_log <- df %>% mutate_all(~ log(.))
  
par(mfrow=c(3,2))
for (i in 1:ncol(df)) {
  barplot(df[,i], main=colnames(df)[i])
  barplot(df_log[,i],
          main=paste("log(", colnames(df)[i], ")"))
}


Output:

Barplot for the data after applying log transformations

Barplot for the data after applying log transformations

In this example, the original data is plotted in the first column of each row, and the transformed data is plotted in the second column of each row. The plots show how the logarithm function changes the distribution of each column.

Applying Different Functions to Different Columns

Sometimes, we may want to apply different functions to different columns. For this, we can use the mutate_at function from the dplyr package. The mutate_at function takes two arguments: the first is a vector of column names or indices, and the second is a formula that specifies the function to be applied.

Let’s say we want to apply the logarithm function to the first and third columns, and the square root function to the second column. Here, the mutate_at function is used twice, once for applying the logarithm function to columns 1 and 3, and once for applying the square root function to column 2. We can plot each column of the original data frame, its logarithm, and its square root:

R




df_log_sqrt <- df %>% 
 mutate_at(c(1, 3), ~ log(.)) %>% 
 mutate_at(2, ~ sqrt(.))
  
par(mfrow=c(3,3))
for (i in 1:ncol(df)) {
barplot(df[,i], main=colnames(df)[i])
barplot(df_log_sqrt[,i],
        main=ifelse(i %in% c(1,3),
                    paste("log(", colnames(df)[i], ")"),
                    paste("sqrt(", colnames(df)[i], ")")))
}


OUTPUT:

Barplot for the data after applying s square root transformations

Barplot for the data after applying s square root transformations

As we can see, the first and third columns are transformed by the logarithm function, while the second column is transformed by the square root function.

Using everything() and across() function

Let’s use the iris dataset as an example, and suppose we want to round all the numerical columns to the nearest integer. We can do this as follows:

R




library(dplyr)
  
# Select only numeric columns
iris_num <- iris %>%
  select(where(is.numeric))
  
# Apply the round function to each numeric column
iris_rounded <- iris_num %>%
  mutate(across(everything(), ~ round(., 0)))


In the above code, the mutate() function is used to apply the round() function to every column using across(). The everything() function is used as the argument to across() to apply the function to all columns. The second argument to round() is 0, which will round each data point to the nearest integer.

The resulting dataset iris_rounded will contain the same columns as iris_num, but with each data, point rounded to the nearest integer.

Using c_across() function

c_across() is a function in the dplyr package in R that allows you to select columns in a tidy-select manner and apply the same function to them. It is commonly used in conjunction with rowwise() to apply functions row-wise to a data frame. c_across() takes a tidy-select object (a set of columns that you want to apply a function too) and returns a list of the output of applying a function to each column.

R




library(dplyr)
  
# create sample data frame
df <- tibble(id = 1:3, a = c(1, 2, 3),
             b = c(4, 5, 6), c = c(7, 8, 9))
  
# use rowwise() and c_across() 
# to get sum of selected columns
df %>% 
  rowwise() %>% 
  mutate(
    sum_cols = sum(c_across(c(a, c)))
  )


Output:

# A tibble: 3 × 5
# Rowwise: 
    id     a     b     c sum_cols
 <int> <dbl> <dbl> <dbl>    <dbl>
1     1     1     4     7        8
2     2     2     5     8       10
3     3     3     6     9       12

In the above example, c_across() is used to select columns ‘a’ and ‘c’, and rowwise() is used to perform row-wise operations on the selected columns. The mutate() function is used to create a new column named sum_cols, which contains the sum of values in columns ‘a’ and ‘c’.

Using starts_with(), ends_with()

starts_with() returns a logical vector indicating which columns’ names start with a particular string. 

R




# Example dataset
df <- tibble(a_col = 1:3,
             b_col = 4:6, c_col = 7:9)
  
# Select columns that start with 'a'
df_starts_with_a <- df %>%
 select(starts_with('a'))
  
print(df_starts_with_a)


Output:

# A tibble: 3 × 1
 a_col
 <int>
1     1
2     2
3     3

ends_with() returns a logical vector indicating which column names end with a particular string.

R




# Example dataset
df <- tibble(a_col_1 = 1:3,
             b_col_0 = 4:6, c_col_1 = 7:9)
  
# Select columns that end with '_1'
df_ends_with_1 <- df %>%
 select(ends_with('_1'))
  
print(df_ends_with_1)


Output:

# A tibble: 3 × 2
 a_col_1 c_col_1
   <int>   <int>
1       1       7
2       2       8
3       3       9

Using if_any() and if_all()

if_any() and if_all() also return a logical vector indicating whether any of the selected columns meet the specified condition. In the below example, the if_all() function filters rows where all of the selected columns have values greater than 7.9.

R




library(dplyr)
  
# Filter rows where all the selected 
# columns have values greater than 20
mtcars %>% 
  filter(if_all(c("mpg", "cyl", "disp"),
                ~. > 7.9))


Output:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Example 2:

 The if_any() function filters rows where any of the selected columns have values greater than 400.

R




library(dplyr)
# Filter rows where any of the selected
# columns have values greater than 200
mtcars %>% 
  filter(if_any(c("mpg", "cyl", "disp"),
                ~. > 400))


Output:

                     mpg cyl disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4

Conclusion

In this article, we have explored several functions in the dplyr package that can be used to apply functions across multiple columns in R. The mutate function is used to apply a function to a single column, while the mutate_all function can be used to apply the same function to all columns. The mutate_at function can be used to apply different functions to different columns.

The cross function is a powerful addition to the dplyr package, allowing you to apply a function to multiple columns using column selection helpers like starts_with() and ends_with(). The c_across() function can be used to select a subset of columns and apply a function to them. The everything() function selects all columns.

Furthermore, the if_any() and if_all() functions in combination with the above-mentioned functions allow for the conditional application of functions. These functions can make complex data manipulations much easier and more efficient.

In conclusion, the dplyr package provides a powerful set of tools for data manipulation in R. By using these functions, you can easily apply functions to multiple columns and perform complex data manipulations with ease.



Last Updated : 16 Mar, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads