Single-Table Analysis with dplyr using R Language

Last Updated : 11 Oct, 2022

The dplyr package is used to perform simulations in the data by performing manipulations and transformations. It can be installed into the working space using the following command :

install.packages("dplyr")

Let’s create the main dataframe:

R

#installing the required libraries
library(dplyr)
 
#creating a data frame
data_frame = data.frame(companies = c("Geekster","GeeksforGeeks","Wipro","TCS",
                                      "GeeksforGeeks","GeeksforGeeks","TCS","Wipro",
                                      "Geekster","Wipro"),
                        people = c(100,NA,532,454,234,554,223,122,432,453),
                        rating = c(4,3,5,NA,5,3,NA,4,5,2))
 
print("Original Data frame")
 
print(data_frame)

Output :

Using pull method

The pull method in the dplyr package in R is used to extract any column of the data frame in the form of a vector. The values displayed in the vector appeared in the same order in which they occur in the data frame.

Syntax : pull(col-name)

Arguments: col-name: the column name to be extracted as a vector

In the following code snippet, the values belonging to the column companies are extracted as a vector.

R

print("Extracting companies vector from data frame")
 
print("Companies vector")
data_frame %>%
  pull(companies)

Output

Using Rename Method

The rename in the dplyr package is used to rename the name of any data frame column in R. The changes are retained to the original data frame.

Syntax : rename(new-col-name = old-col-name)

Arguments

new-col-name: the new column of the data frame

old-col-name: the existing column of the data frame

In the following code snippet, the column rating is renamed by the name feedback_rating.

R

print("Renaming rating column")
data_frame %>%
    rename(feedback_rating = rating)

Output

Using Arrange Method

The arrange method in the dplyr package is an important method to perform the sorting of data based on the values present at least in one column. By default the data is arranged in ascending order according to the column which is specified as the argument to the arranged method. In the following code snippet, the data frame rows are arranged according to the column rating, wherein the rows with the minimum rating are displayed first in the output.

Syntax : arrange(col-name-to-sort-the-data)

R

print("Arranging data frame by rating column")
data_frame %>%
  arrange(rating)

Output:

Using Filter Method

The filter method in the dplyr package in R is used to select a subset of rows of the original data frame based on whether the specified condition holds true. The condition may use any logical or comparative operator to filter the necessary values.

Syntax : filter(data , cond)

Arguments:

data- the data frame to be manipulated

cond- the condition to be checked to filter the values

In the following code snippet, we are removing the values that are equivalent to the value NA for the people column.

R

print("Arranging data frame by rating column")
data_frame %>%
  filter(!is.na(people))

Output

Using Summarize Method

The summarize() method is used to returns a single row of output. In order to do this, it has to relate entire columns into a single values.

Syntax : summarize(most_bellas = max(column_name))

Arguments: column_name- the name of the column on which the summarization to be done.

R

# summarize
data_frame %>%
  summarize(num_rows = n(),most_bellas = max(companies))

Output

wipro

Manipulation and Analyze data using dplyr

The following code snippet, will discuss the application of a large number of single table verbs using the dplyr package in order to manipulate and analyze data. The first step deals with the removal of NA values from the column people using the filter method. The pipe operator is then applied to this result in order to add a new column by using the existing column values. The column total_people is added to the data frame as a result of the multiplication values of the rating and people column values using the mutate method. The resultant data frame will now contain 4 columns.

This is followed by the selection of columns using the select method wherein only the columns, companies, and total_people are displayed. Now the group_by method is applied to group this data frame based on different companies that occur within it. The data belonging to these groups can be analyzed statically using the summarised method. The summarised method creates a new column mean_rating which uses the sum of ratings of the people of each companies divided by the total number of people in it.

1. Creating a Dataframe

R

#creating a data frame
data_frame = data.frame(companies = c("Geekster","GeeksforGeeks","Wipro","TCS",
                                      "GeeksforGeeks","GeeksforGeeks","TCS","Wipro",
                                      "Geekster","Wipro"),
                        people = c(100,123,NA,454,234,554,223,122,432,453),
                        rating = c(4,3,5,2,2,3,1,4,5,3))
 
print("Original Data frame")
print(data_frame)

Output :

2. Filtering Data based on a Condition

R

#filter data based on condition
print("Application of multiple operations ")
data1 = data_frame %>%
  filter(!is.na(people)) 
print("Data after removal of people with NA value")
print(data1)

Output :

3. Computing new column using other columns

R

#computing total_people who rated column
data2 = data1%>%
  mutate(total_people = rating * people)
print("Data1 on computing on total people who rated")
print(data2)

Output :

4. Select few columns from the data.

R

#selecting only specific columns
data3 = data2%>%
  select(companies,total_people)
print("Data2 on selecting companies and total peoples' ratings")
print(data3)

Output :

5. Grouping data using group_by()

R

#grouping the data based on company
data4 = data3%>%
  group_by(companies)%>%
  summarise(mean_rating = sum(total_people)/n())
print("Data grouped on companies and mean rating given")
print(data4)

Output :

Suggest improvement

Recursive Functions in R Programming

parse() Function in R

Share your thoughts in the comments

Single-Table Analysis with dplyr using R Language

R

Using pull method

R

Using Rename Method

R

Using Arrange Method

R

Using Filter Method

R

Using Summarize Method

R

Manipulation and Analyze data using dplyr

1. Creating a Dataframe

R

2. Filtering Data based on a Condition

R

3. Computing new column using other columns

R

4. Select few columns from the data.

R

5. Grouping data using group_by()

R

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?