Open In App

Single-Table Analysis with dplyr using R Language

Last Updated : 11 Oct, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

The dplyr package is used to perform simulations in the data by performing manipulations and transformations. It can be installed into the working space using the following command :

install.packages("dplyr")

Let’s create the main dataframe:

R




#installing the required libraries
library(dplyr)
 
#creating a data frame
data_frame = data.frame(companies = c("Geekster","GeeksforGeeks","Wipro","TCS",
                                      "GeeksforGeeks","GeeksforGeeks","TCS","Wipro",
                                      "Geekster","Wipro"),
                        people = c(100,NA,532,454,234,554,223,122,432,453),
                        rating = c(4,3,5,NA,5,3,NA,4,5,2))
 
print("Original Data frame")
 
print(data_frame)


Output :

Output

 

Using pull method

The pull method in the dplyr package in R is used to extract any column of the data frame in the form of a vector. The values displayed in the vector appeared in the same order in which they occur in the data frame. 

Syntax : pull(col-name)

Arguments: col-name: the column name to be extracted as a vector

In the following code snippet, the values belonging to the column companies are extracted as a vector.

R




print("Extracting companies vector from data frame")
 
print("Companies vector")
data_frame %>%
  pull(companies)


Output

pull function

 

Using Rename Method

The rename in the dplyr package is used to rename the name of any data frame column in R. The changes are retained to the original data frame. 

Syntax : rename(new-col-name = old-col-name)

Arguments 

new-col-name: the new column of the data frame

old-col-name: the existing column of the data frame

In the following code snippet, the column rating is renamed by the name feedback_rating.

R




print("Renaming rating column")
data_frame %>%
    rename(feedback_rating = rating)


Output

rename function

 

Using Arrange Method 

The arrange method in the dplyr package is an important method to perform the sorting of data based on the values present at least in one column. By default the data is arranged in ascending order according to the column which is specified as the argument to the arranged method. In the following code snippet, the data frame rows are arranged according to the column rating, wherein the rows with the minimum rating are displayed first in the output.

Syntax : arrange(col-name-to-sort-the-data)

R




print("Arranging data frame by rating column")
data_frame %>%
  arrange(rating)


Output:

arrange function

 

Using Filter Method

The filter method in the dplyr package in R is used to select a subset of rows of the original data frame based on whether the specified condition holds true. The condition may use any logical or comparative operator to filter the necessary values.

Syntax : filter(data , cond) 

Arguments:

data- the data frame to be manipulated

cond-  the condition to be checked to filter the values

In the following code snippet, we are removing the values that are equivalent to the value NA for the people column.

R




print("Arranging data frame by rating column")
data_frame %>%
  filter(!is.na(people))


Output

filter function

 

Using Summarize Method 

The summarize() method is used to returns a single row of output. In order to do this, it has to relate entire columns into a single values.  

Syntax : summarize(most_bellas = max(column_name))

Arguments: column_name-  the name of the column on which the  summarization to be  done.

R




# summarize
data_frame %>%
  summarize(num_rows = n(),most_bellas = max(companies))


Output

wipro

Manipulation and Analyze data using dplyr

The following code snippet, will discuss the application of a large number of single table verbs using the dplyr package in order to manipulate and analyze data. The first step deals with the removal of NA values from the column people using the filter method. The pipe operator is then applied to this result in order to add a new column by using the existing column values. The column total_people is added to the data frame as a result of the multiplication values of the rating and people column values using the mutate method. The resultant data frame will now contain 4 columns.

This is followed by the selection of columns using the select method wherein only the columns, companies, and total_people are displayed. Now the group_by method is applied to group this data frame based on different companies that occur within it. The data belonging to these groups can be analyzed statically using the summarised method. The summarised method creates a new column mean_rating which uses the sum of ratings of the people of each companies divided by the total number of people in it.

1.  Creating a Dataframe

R




#creating a data frame
data_frame = data.frame(companies = c("Geekster","GeeksforGeeks","Wipro","TCS",
                                      "GeeksforGeeks","GeeksforGeeks","TCS","Wipro",
                                      "Geekster","Wipro"),
                        people = c(100,123,NA,454,234,554,223,122,432,453),
                        rating = c(4,3,5,2,2,3,1,4,5,3))
 
print("Original Data frame")
print(data_frame)


Output :

Dataset

 

2. Filtering Data based on a Condition

R




#filter data based on condition
print("Application of multiple operations ")
data1 = data_frame %>%
  filter(!is.na(people))
print("Data after removal of people with NA value")
print(data1)


Output :

Applications of multiple operations

 

3. Computing new column using other columns

R




#computing total_people who rated column
data2 = data1%>%
  mutate(total_people = rating * people)
print("Data1 on computing on total people who rated")
print(data2)


Output :

Computing total people

 

4. Select few columns from the data.

R




#selecting only specific columns
data3 = data2%>%
  select(companies,total_people)
print("Data2 on selecting companies and total peoples' ratings")
print(data3)


Output :

Selecting columns

 

5. Grouping data using group_by()

R




#grouping the data based on company
data4 = data3%>%
  group_by(companies)%>%
  summarise(mean_rating = sum(total_people)/n())
print("Data grouped on companies and mean rating given")
print(data4)


Output :

data group_by

 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads