Open In App

Loading and Cleaning Data with R and the tidyverse

Improve
Improve
Like Article
Like
Save
Share
Report

The tidyverse is a collection of packages that work well together due to shared data representations and API design. The tidyverse package is intended to make it simple to install and load core tidyverse packages with a single command.

To install tidyverse, put the following code in RStudio:

R




# Install from CRAN
install.packages("tidyverse")
  
# to check your installation 
library(tidyverse)


Output:

── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

The tidyr package will be used for data cleaning, and the readr package will be used for data loading.

Data loading using readr

Dear Friends, In this tutorial, we will read and parse a CSV file using the readr package’s read CSV function. CSV (Comma-Separated Values) files contain data separated by commas. The following CSV file will be used in the following example. To begin, pass the path to the file to be read to the read_csv function. The read CSV function generates tibbles that can be attached to variables. 

R




# load the tidyverse by running this code:
library(tidyverse)
  
# create a tibble named rand
rand <- read_csv("Example.csv")


Output:

chr (2): ── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Code, Age_single_years
dbl (2): Census_night_population_count, Census_usually_resident_population_count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message

 

Inline CSV input is very useful, and these options can also help you in normal file parsing too.

R




# give inline csv input
read_csv("a,b,c
  1,2,3
  4,5,6")


Output:

      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

The first line of a CSV file is the name of the columns. However, there are other options for dealing with exceptions. 

R




read_csv("first line of metadata
  second line of metadata
  a,b,c
  1,2,3", skip = 2)


Output:

      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3

R




# when we need to ignore comments in csv file
read_csv("#ignore it is a comment
 #ignore this is another comment
 x,y,z
 1,2,3
 4,5,6", comment = "#")


Output:

      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

If the first line is not the name of the columns, then we can do this

R




# If you do not set column names then R does it for you.
# The false flag tells the computer that the 
# first line is not column names.
read_csv("1,2,3\n4,5,6", col_names=FALSE)


Output:

     X1    X2    X3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

R




# You can set custom column names
read_csv("1,2,3\n4,5,6", col_names = c("COLUMN1","COLUMN2","COLUMN3"))


Output:

  COLUMN1 COLUMN2 COLUMN3
    <dbl>   <dbl>   <dbl>
1       1       2       3
2       4       5       6

R




# you can use na to represent missing data
read_csv("a,b,c\n1,2,.", na = ".")


Output:

      a     b c    
  <dbl> <dbl> <lgl>
1     1     2 NA 
 

Cleaning Data with tidyverse (What even is tidy data?)

There are three rules of tidy data.

  • Every variable is a column.
  • Every observation is a row.
  • And every value is a cell.

 

First, see examples of tidy and untidy data.

R




# tidy data
table1


Output:

  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Examples of untidy data and how to deal with it.

pivot_wider()

In table2, a single observation is scattered across several rows, this can be fixed by using the pivot_wider() option

  •    The column to take variable names from. Here, it’s typed.
  •    The column to take values from. Here it counts.

R




table2


Output:

   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

R




table2 %>% pivot_wider(names_from=type,
                       values_from=count)


Output:

  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

separate()

In table three, we have to separate two values in a column.

R




table3


Output:

  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

R




table3 %>% separate(rate,
                    into = c("cases", "population"),
                    sep = "/")


Output:

  country      year cases  population
  <chr>       <int> <chr>  <chr>
1 Afghanistan  1999 745    19987071
2 Afghanistan  2000 2666   20595360
3 Brazil       1999 37737  172006362
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

“cases” and “population” are character columns, which is the default behavior in separate(). It leaves the type of the column as it is, we can convert to better types using “convert = TRUE”.

R




table3 %>% separate(rate,
                    into=c("cases", "population"),
                    convert=TRUE)


Output:

  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

You can also pass an integer to “sep” which will interpret the integers as positions to split at. Indexing from 1 from the left and -1 from the right.

R




table3 %>% separate(year,
                    into=c("century", "year"),
                    sep=2)


Output:

  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

pivot_longer()

When some column names are not names of variables, but values of a variable.

  • The set of columns whose names are values, not variables. In this example, those are the columns “1999” and “2000”.
  • The name of the variable to move the column names to here is “year”.
  • The name of the variable to move the column values to here it’s “cases”.

R




table4a


Output:

  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

R




table4a %>% pivot_longer(c(`1999`, `2000`),
                           names_to="year",
                           values_to="cases")


Output:

  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

unite()

Use “unite()” to rejoin the century and year columns that we created in the last example. “unite()” takes a tibble and the name of the new variable to create, and a column to combine.

R




table5 %>% unite(new, century, year)


Output:

   country     new   rate             
   <chr>       <chr> <chr>            
 1 Afghanistan 19_99 745/19987071     
 2 Afghanistan 20_00 2666/20595360    
 3 Brazil      19_99 37737/172006362  
 4 Brazil      20_00 80488/174504898  
 5 China       19_99 212258/1272915272
 6 China       20_00 213766/1280428583

We also need to use the sep argument, because by default R will place an underscore (_) between the values from different columns. Here we don’t want any separator, so we use “”,

R




table5 %>% unite(new, century, year, sep = "")


Output:

   country     new   rate             
   <chr>       <chr> <chr>            
 1 Afghanistan 1999  745/19987071     
 2 Afghanistan 2000  2666/20595360    
 3 Brazil      1999  37737/172006362  
 4 Brazil      2000  80488/174504898  
 5 China       1999  212258/1272915272
 6 China       2000  213766/1280428583

Missing values

A value can be missing in two ways,

  • Explicitly – NA.
  • Implicitly – not present in the data.

R




table98 <- tibble(
  country = c("Afghanistan", "Afghanistan", "Brazil", "China", "China"),  
  year   = c(1999, 2000, 1999, 1999, 2000),
  cases    = c(   745,    2666,    37737,    80488,    212258),
  population = c(19987071, 20595360, 172006362,   NA, 1280428583)
)


There are two missing values here,

  •    The population of 1999 “China” is explicitly missing because its cell has NA.
  •    The population of 2000 “Brazil” is explicitly missing because it doesn’t appear in the data.

We can make the implicit missing value explicit by putting years in the columns:

R




table98 %>% pivot_wider(names_from=year,
                        values_from=population)


Output:

  country      cases    `1999`     `2000`
  <chr>        <dbl>     <dbl>      <dbl>
1 Afghanistan    745  19987071         NA
2 Afghanistan   2666        NA   20595360
3 Brazil       37737 172006362         NA
4 China        80488        NA         NA
5 China       212258        NA 1280428583

You can set “values_drop_na = TRUE” in “pivot_longer()” to turn explicit missing values implicit,

R




table98 %>% 
  pivot_wider(names_from = year, values_from = population) %>% 
  pivot_longer(
    cols = c(`1999`, `2000`), 
    names_to = "year"
    values_to = "population"
    values_drop_na = TRUE
  )


Output:

  country      cases year  population
  <chr>        <dbl> <chr>      <dbl>
1 Afghanistan    745 1999    19987071
2 Afghanistan   2666 2000    20595360
3 Brazil       37737 1999   172006362
4 China       212258 2000  1280428583

complete() 

complete() takes a set of columns, and finds all unique combinations, filling in explicit NAs where necessary.

R




table98 %>% complete(year, cases)


Output:

    year  cases country     population
   <dbl>  <dbl> <chr>            <dbl>
 1  1999    745 Afghanistan   19987071
 2  1999   2666 NA                  NA
 3  1999  37737 Brazil       172006362
 4  1999  80488 China               NA
 5  1999 212258 NA                  NA
 6  2000    745 NA                  NA
 7  2000   2666 Afghanistan   20595360
 8  2000  37737 NA                  NA
 9  2000  80488 NA                  NA
10  2000 212258 China       1280428583

fill()

fill in those missing values with fill(). Replaces missing values with the most recent non-missing value (sometimes called the last observation carried forward).

R




treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Gautam",           1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "heema",            1,           4
)
  
treatment %>% fill(person)


Output:

  person treatment response
  <chr>      <dbl>    <dbl>
1 Gautam         1        7
2 Gautam         2       10
3 Gautam         3        9
4 heema          1        4


Last Updated : 07 Nov, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads