Data Munging in R Programming

Data Munging is the general technique of transforming data from unusable or erroneous form to useful form. Without a few degrees of data munging (irrespective of whether a specialized user or automated system performs it), the data can’t be ready for downstream consumption. Basically the procedure of cleansing the data manually is known as data munging. In R Programming the following ways are oriented with data munging process:

  • apply() Family
  • aggregate()
  • dplyr package
  • plyr package

Using apply() Family for Data Munging

In apply() collection of R the most basic function is the apply() function. Apart from that, there exists lapply(), sapply() and tapply(). The entire collection of apply() can be considered a substitute for a loop. It is the most restrictive type of function. It should be performed on a matrix which contains all homogeneous element. If the apply() function is performed using a data frame or any other kind of object, the function will first change it to a matrix and then perform its operation. It is basically used to avoid the explicit use of loop structure or construct.

Syntax:

apply(X, margin, function)

Parameters:



x: an array or matrix

margin: a value between 1 and 2 in order to decide where to apply the function [ 1- row; 2- column]

function: the function to apply

Example:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# Using apply()
m <- matrix(C <- (1:10), 
            nrow = 5,
            ncol = 6)
m
a_m <- apply(m, 2, sum)
a_m

chevron_right


Output:

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    6    1    6    1    6
[2,]    2    7    2    7    2    7
[3,]    3    8    3    8    3    8
[4,]    4    9    4    9    4    9
[5,]    5   10    5   10    5   10

[1] 15 40 15 40 15 40

In the above example, we are calculating the sum of the elements column-wise. Hence for a large set of data, we can easily produce the desired output. 

The lapply() function is used to perform operations on a list and it returns a resultant list of the same size as the input list. The ‘l’ in lapply() refers to lists. The lapply() function does not need the margin parameter.



Syntax:

lapply(X, func)

Parameters:

X: the list or a vector or an object

func: the function to apply

Example:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# Using lapply()
movies <- c("SPIDERMAN", "BATMAN",
            "AVENGERS", "FROZEN")
movies
movies_lower <- lapply(movies,
                       tolower)
str(movies_lower)

chevron_right


Output:

[1] "SPIDERMAN" "BATMAN"    "AVENGERS"  "FROZEN"   

List of 4
 $ : chr "spiderman"
 $ : chr "batman"
 $ : chr "avengers"
 $ : chr "frozen"

The sapply() function takes any vector or object or list and performs the exact operation as the lapply() function. Both of them have the same syntax.

The tapply() function is used to calculate or measure mean, median, maximum, and so on, or to perform a function on each and every factor of the variable. It is efficiently used to create a subset of any vector and then to apply or perform any function on them.



Syntax:

tapply(X, index, func = NULL)

Parameters:

X: an object or vector

index: a list of factor

func: the function to apply

Example:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# Using tapply()
data(iris)
tapply(iris$Sepal.Width, 
       iris$Species,
       median)

chevron_right


Output:

setosa versicolor  virginica 
 3.4        2.8        3.0 

Using aggregate() in Data Munging

In R, aggregate() function is used to combine or aggregate the input data frame by applying a function on each column of a sub-data frame. In order to perform aggregation or to apply the aggregate() function we must include the following:



  • The input data that we wish to aggregate
  • The variable within the data that will be used to group by
  • The function or calculation to apply

The aggregate() function will always return a data frame which contains all unique values from the input data frame after applying the specific function. We can only apply a single function inside an aggregate function. In order to include multiple functions inside the aggregate() function, we need to use the plyr package.

Syntax:

aggregate(formula, data, function)

Parameters:

formula: the variable(s) of the input data frame we want to apply functions on.

data: the data that we want to use for group by operation.

function: the function or calculation to be applied.

Example:

R



filter_none

edit
close

play_arrow

link
brightness_4
code

# R prograom to illustrate
# aggregate() function
assets <- data.frame(
  asset.class = c("equity", "equity",
                  "equity", "option",
                  "option", "option",
                  "bond", "bond"),
       rating = c("AAA", "A", "A",
                  "AAA", "BB", "BB",
                  "AAA", "A"),
counterparty.a = c(runif(3), rnorm(5)),
counterparty.b = c(runif(3), rnorm(5)),
counterparty.c = c(runif(3), rnorm(5)))
assets
exposures <- aggregate(
  x = assets[c("counterparty.a"
               "counterparty.b",
               "counterparty.c")],
 by = assets[c("asset.class", "rating")],
 FUN = function(market.values){
 sum(pmax(market.values, 0))
                       })
exposures

chevron_right


Output:

  asset.class rating counterparty.a counterparty.b counterparty.c
1      equity    AAA     0.08250275      0.5474595      0.9966172
2      equity      A     0.33931258      0.6442402      0.2348197
3      equity      A     0.68078755      0.5962635      0.6126720
4      option    AAA    -0.47624689     -0.4622881     -1.2362731
5      option     BB    -0.78860284      0.3219559     -1.2847157
6      option     BB    -0.59461727     -0.2840014     -0.5739735
7        bond    AAA     1.65090747      1.0918564      0.6179858
8        bond      A    -0.05402813      0.1602164      1.1098481

  asset.class rating counterparty.a counterparty.b counterparty.c
1        bond      A     0.00000000      0.1602164      1.1098481
2      equity      A     1.02010013      1.2405038      0.8474916
3        bond    AAA     1.65090747      1.0918564      0.6179858
4      equity    AAA     0.08250275      0.5474595      0.9966172
5      option    AAA     0.00000000      0.0000000      0.0000000
6      option     BB     0.00000000      0.3219559      0.0000000

We can see that in the above example the values of assets data frame have been aggregated on “asset.class” and “rating” columns.

Using the plyr Package for Data Munging

The plyr package is used for splitting, applying, and combining data. The plyr is a set of tools that can be used for splitting up huge or big data for creating a homogeneous piece, then applying a function on each and every piece and finally combine all the resultant values. We can already perform these actions in R, but on using plyr we can do it easily since:

  • The names, arguments, and outputs are totally consistent
  • Convenient parallelism
  • Both input and output involves data frames, matrices or lists
  • To track the long execution or running programs it provides a progress bar
  • Built-in informative error messages and error recovery
  • Labels which are maintained through all transformations.

The two functions that we are going to discuss in this section are ddply() and llply(). For each subset of a given data frame, the ddply() applies a function and then combine the result.

Syntax:

ddply(.data, .variables, .fun = NULL, …, .progress = “none”, .inform = FALSE,

         .drop = TRUE, .parallel = FALSE, .paropts = NULL)

Parameters:

data: the data frame that is to be processed

variable: the variable based on which it will split the data frame



fun: the function to be applied

…: other arguments that are passed to fun

progress: name of the progess bar

inform: whether to produce any informative error message

drop: combination of variables that is not in the input data frame should be preserved or dropped.  

parallel: whether to apply function parallel

paropts: list of extra or additional options passed

Example:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# Using ddply()
library(plyr)
dfx <- data.frame(
  group = c(rep('A', 8), 
            rep('B', 15), 
            rep('C', 6)),
  sex = sample(c("M", "F"),
               size = 29,
               replace = TRUE),
  age = runif(n = 29,
              min = 18,
              max = 54)
)
  
ddply(dfx, .(group, sex), summarize,
      mean = round(mean(age), 2),
      sd = round(sd(age), 2))

chevron_right


Output:



  group sex  mean    sd
1     A   F 41.00  9.19
2     A   M 35.76 12.14
3     B   F 34.75 11.70
4     B   M 40.01 10.10
5     C   F 25.13 10.37
6     C   M 43.26  7.63

Now we will see how to use llply() to work on data munging. The llply() function is used on each element of lists, where we apply a function on them, and the combined resultant output is also a list.

Syntax:

llply(.data, .fun = NULL,
…, .progress = “none”, .inform = FALSE,
.parallel = FALSE, .paropts = NULL)

Example:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# Using llply()
library(plyr)
x <- list(a = 1:10, beta = exp(-3:3), 
          logic = c(TRUE, FALSE,
                    FALSE, TRUE))
llply(x, mean)
llply(x, quantile, probs = 1:3 / 4)

chevron_right


Output:

$a
[1] 5.5

$beta
[1] 4.535125

$logic
[1] 0.5

$a
 25%  50%  75% 
3.25 5.50 7.75 

$beta
      25%       50%       75% 
0.2516074 1.0000000 5.0536690 

$logic
25% 50% 75% 
0.0 0.5 1.0 

Using dplyr package for Data Munging

The dplyr package can be considered as a grammar of data manipulation which is providing us a consistent set of verbs that helps us to solve some most common challenges of data manipulation:

  • arrange()  is used to change the order of the rows.
  • filter() is used to pick cases depending on their value or based on the value.
  • mutate() is used to add new variables that are functions of already existing variables.
  • select() is used to pick or select variables based on their names.
  • summarize() is used to reduce multiple values to a single summary.

There are many more functions under dplyr. The dplyr uses a very efficient backend which leads to less waiting time for the computation. It is more efficient than the plyr package. 

 Syntax:

arrange(.data, …, .by_group = FALSE)



filter(.data, …)

mutate(.data, …)

select(.data, …)

summarize(X, by, fun, …, stat.name = deparse(substitue(X)),  

                  type = c(“variable”,”matrix”), subset = TRUE, keepcolnames = FALSE)

Example:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# Using dplyr package
  
# Import the library
library(dplyr)
  
# Using arrange()
starwars %>% 
    arrange(desc(mass))
  
# Using filter()
starwars %>% 
    filter(species == "Droid")
  
# Using mutate() 
starwars %>%
    mutate(name, 
    bmi = mass / ((height / 100)  ^ 2)) %>% 
    select(name:mass, bmi)
  
# Using select()
starwars %>% 
    select(name, ends_with("color"))
  
# Using summarise()
starwars %>% group_by(species) %>% 
  summarise(n = n(),
  mass = mean(mass, na.rm = TRUE)) %>%
  filter(n > 1)

chevron_right


Output:

> starwars %>% arrange(desc(mass))
# A tibble: 87 x 13
   name     height  mass hair_color  skin_color   eye_color  birth_year gender  homeworld species films vehicles starships
   <chr>     <int> <dbl> <chr>       <chr>        <chr>           <dbl> <chr>   <chr>     <chr>   <lis> <list>   <list>   
 1 Jabba D~    175  1358 NA          green-tan, ~ orange          600   hermap~ Nal Hutta Hutt    <chr~ <chr [0~ <chr [0]>
 2 Grievous    216   159 none        brown, white green, ye~       NA   male    Kalee     Kaleesh <chr~ <chr [1~ <chr [1]>
 3 IG-88       200   140 none        metal        red              15   none    NA        Droid   <chr~ <chr [0~ <chr [0]>
 4 Darth V~    202   136 none        white        yellow           41.9 male    Tatooine  Human   <chr~ <chr [0~ <chr [1]>
 5 Tarfful     234   136 brown       brown        blue             NA   male    Kashyyyk  Wookiee <chr~ <chr [0~ <chr [0]>
 6 Owen La~    178   120 brown, grey light        blue             52   male    Tatooine  Human   <chr~ <chr [0~ <chr [0]>
 7 Bossk       190   113 none        green        red              53   male    Trandosha Trando~ <chr~ <chr [0~ <chr [0]>
 8 Chewbac~    228   112 brown       unknown      blue            200   male    Kashyyyk  Wookiee <chr~ <chr [1~ <chr [2]>
 9 Jek Ton~    180   110 brown       fair         blue             NA   male    Bestine ~ Human   <chr~ <chr [0~ <chr [1]>
10 Dexter ~    198   102 none        brown        yellow           NA   male    Ojom      Besali~ <chr~ <chr [0~ <chr [0]>
# ... with 77 more rows

> starwars %>% filter(species == "Droid")
# A tibble: 5 x 13
  name  height  mass hair_color skin_color  eye_color birth_year gender homeworld species films     vehicles  starships
  <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr>  <chr>     <chr>   <list>    <list>    <list>   
1 C-3PO    167    75 NA         gold        yellow           112 NA     Tatooine  Droid   <chr [6]> <chr [0]> <chr [0]>
2 R2-D2     96    32 NA         white, blue red               33 NA     Naboo     Droid   <chr [7]> <chr [0]> <chr [0]>
3 R5-D4     97    32 NA         white, red  red               NA NA     Tatooine  Droid   <chr [1]> <chr [0]> <chr [0]>
4 IG-88    200   140 none       metal       red               15 none   NA        Droid   <chr [1]> <chr [0]> <chr [0]>
5 BB8       NA    NA none       none        black             NA none   NA        Droid   <chr [1]> <chr [0]> <chr [0]>

> starwars %>% mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>% select(name:mass, bmi)
# A tibble: 87 x 4
   name               height  mass   bmi
   <chr>               <int> <dbl> <dbl>
 1 Luke Skywalker        172    77  26.0
 2 C-3PO                 167    75  26.9
 3 R2-D2                  96    32  34.7
 4 Darth Vader           202   136  33.3
 5 Leia Organa           150    49  21.8
 6 Owen Lars             178   120  37.9
 7 Beru Whitesun lars    165    75  27.5
 8 R5-D4                  97    32  34.0
 9 Biggs Darklighter     183    84  25.1
10 Obi-Wan Kenobi        182    77  23.2
# ... with 77 more rows

> starwars %>% select(name, ends_with("color"))
# A tibble: 87 x 4
   name               hair_color    skin_color  eye_color
   <chr>              <chr>         <chr>       <chr>    
 1 Luke Skywalker     blond         fair        blue     
 2 C-3PO              NA            gold        yellow   
 3 R2-D2              NA            white, blue red      
 4 Darth Vader        none          white       yellow   
 5 Leia Organa        brown         light       brown    
 6 Owen Lars          brown, grey   light       blue     
 7 Beru Whitesun lars brown         light       blue     
 8 R5-D4              NA            white, red  red      
 9 Biggs Darklighter  black         light       brown    
10 Obi-Wan Kenobi     auburn, white fair        blue-gray
# ... with 77 more rows

> starwars %>% group_by(species) %>% 
+   summarise(n = n(),mass = mean(mass, na.rm = TRUE)) %>%
+   filter(n > 1)
# A tibble: 9 x 3
  species      n  mass
  <chr>    <int> <dbl>
1 Droid        5  69.8
2 Gungan       3  74  
3 Human       35  82.8
4 Kaminoan     2  88  
5 Mirialan     2  53.1
6 Twi'lek      2  55  
7 Wookiee      2 124  
8 Zabrak       2  80  
9 NA           5  48  



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

1


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.