Open In App

Data Wrangling in R Programming – Data Transformation

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Share
Report issue
Report

A dataset can be presented in many different ways to the world. Let us look at one of the most essential and fundamental distinctions, whether a dataset is wide or long.

The difference between wide and long datasets condenses to whether we prefer to have more rows in our dataset or more columns. A dataset that spotlights on putting additional data about a single column is known as a wide dataset because as we add more and more columns, the data set becomes wider. Similarly, a dataset that focuses on including data about a subject in rows is called a long dataset.

In Data Wrangling in R, sometimes, we need to make long datasets wider and vice-versa. In general, data scientists who embrace the concept of tidy data usually prefer long datasets over wide ones, because longer data sets are more comfortable to manipulate in R.

wide vs long

In the above figure, the same dataset is represented as a wide dataset as well as a long dataset. It is a dataset with religions with the income classification. As you got to know what are long and wide datasets, let us try to use tools in R to convert wide to long datasets and long to wide.

Conversion of Wide dataset to long

The gather() function in the 'tidyr' package makes wide datasets long. The gather function works on the concept of keys and values. The data values represent an observation of a single variable while the key is a name used to identify the variable described by the value.

key and value

In the dataset above, income acts as the key by classifying the income of different religions, and frequency provides the values to the income key.

Syntax:
gather(data, key, value, columns)

Parameters:
data: The Tibble name
key: The name that we would like to use for the key column in the long dataset.
value: The name we would like to apply for the value column in the long dataset.
columns: list of columns from the wide dataset that we would like to include or exclude from the gathering.

Alternatively, if you want to gather most of the columns, you can specify the columns that you don’t want to collect by listing them with a minus sign (-) in front of them.




# Making Wide Datasets Long with gather()
  
# Load the tidyverse
library(tidyverse)   
  
# Read in the dataset
sample_data <- read.csv("C:/Users/Admin/Desktop/pew.csv"
  
sample_data
  
sample_data_long <- gather(sample_data, income, freq, -religion)
  
sample_data_long


Output:

gather1

gather2

Conversion of Long datasets to wide

There is sometimes a need to perform a reverse operation of gather() function. So the spread() function is used to convert long datasets to wider datasets.

Syntax:
spread(data, key, value)

Parameters:
data: The Tibble name
key: The name that we would like to use for the key column in the long dataset.
value: The name we would like to apply for the value column in the long dataset.




library(tidyverse)   
sample_data <- read.csv("C:/Users/Admin/Desktop/mexicanweather.csv")
sample_data
  
sample_data_wide <- spread(sample_data, element, value)
  
sample_data_wide


Output:

spread1

spread2

Now we can see the Tibble that is half the size of the Mexican dataset. We have Tmax and Tmin columns and no longer have the element or value columns.



Last Updated : 22 Jun, 2020
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads