Count the frequency of a variable per column in R Dataframe

Last Updated : 30 May, 2021

A data frame may contain repeated or missing values. Each column may contain any number of duplicate or repeated instances of the same variable. Data statistics and analysis mostly rely on the task of computing the frequency or count of the number of instances a particular variable contains within each column. In this article, we are going to see how to find the frequency of a variable per column in Dataframe in R Programming Language.

Method 1: Using plyr package

The plyr package is used preferably to experiment with the data, that is, create, modify and delete the columns of the data frame, subjecting them to multiple conditions and user-defined functions. It can be downloaded and loaded into the workspace using the following command:

install.packages("lpyr")

The ldply() method of this package is used to apply a pre-defined function over each element of a list and then combine the results into a data frame. This method can be used to calculate the frequency of the variable belonging to integer, character, or factor type class.

Syntax: ldply(data, fun = NULL)

Arguments :

data – The data over which to apply

fun – The function to be applied

In this method, the sum() function is applied as a function over the elements of each column belonging to the data frame. The function results in the summation of the number of times a particular specified value occurs within the column. The function is applied individually over each column. The output returned is in the form of a data frame where the first column gives the column names assigned to the data frame and the second column displays the total number of occurrences of the specified variable in that column.

Code:

R

library ('plyr') 
set.seed(1)   
# creating a data frame 
data_table <- data.frame(col1 =  sample(letters[1:3], 8,  
                                        replace = TRUE) , 
                         col2 = sample(letters[1:3], 8, 
                                       replace = TRUE), 
                         col3 = sample(letters[1:3], 8, 
                                       replace = TRUE), 
                         col4 = sample(letters[1:3], 8,  
                                       replace = TRUE)) 
  
print ("Original DataFrame") 
print (data_table) 
print ("Count of value per column") 
  
# count number of c in each column 
ldply(data_table, function(c) sum(c =="a")) 

Output:

[1] "Original DataFrame" 
  col1 col2 col3 col4 
1    a    b    b    a 
2    c    c    b    b 
3    a    c    c    a 
4    b    a    a    a 
5    a    a    c    b 
6    c    a    a    b 
7    c    b    a    b 
8    b    b    a    a    
[1] "Count of value per column"
   .id V1 
1 col1  3 
2 col2  3 
3 col3  4 
4 col4  4

The method can also be used to calculate the frequency of a vector of values. The function is defined in such a way that it validates the occurrence of an element inside a vector using the %in% operator. The summation of TRUE occurrences within each column is then returned as the counts.

val %in% vec

Code:

R

library ('plyr') 
  
set.seed(1)   
  
# creating a data frame 
data_table <- data.frame(col1 =  sample(letters[1:3], 8,  
                                        replace = TRUE) , 
                         col2 = sample(letters[1:3], 8, 
                                       replace = TRUE), 
                         col3 = sample(letters[1:3], 8, 
                                       replace = TRUE), 
                         col4 = sample(letters[1:3], 8,  
                                       replace = TRUE)) 
  
print ("Original DataFrame") 
print (data_table) 
print ("Count of value per column") 
ldply(data_table, function(c) sum(c %in% vec)) 

Output:

[1] "Original DataFrame"
  col1 col2 col3 col4
1    a    b    b    a
2    c    c    b    b
3    a    c    c    a
4    b    a    a    a
5    a    a    c    b
6    c    a    a    b
7    c    b    a    b
8    b    b    a    a    
[1] "Count of value per column"
   .id V1
1 col1  5
2 col2  6
3 col3  6
4 col4  8

Method 2: Using sapply() method

The sapply() method, which is used to compute the frequency of the occurrences of a variable within each column of the data frame. The sapply() method is used to apply functions over vectors or lists, and return outputs based on these computations.

sapply (df , FUN)

In this case, the FUN is a user-defined function that initially computed the number of levels within the entire data frame cells. This is done by the application of the unlist() methods which are used to convert a data frame into a nested list. This is followed by the application of unique() which extracts only the unique variable values contained in the data frame.

unique (list)

The vector obtained as an output of the unique() method is explicitly converted to a factor type object by the factor() method, where the levels are the unique values encountered. All the components are thus mapped to levels within this vector.

factor (vec)

In the end, the table() method is then applied. The table() method takes the cross-classifying factors belonging in a vector to build a contingency table of the counts at each combination of factor levels. A contingency table is basically a tabulation of the counts and/or percentages for multiple variables. It excludes the counting of any missing values from the factor variable supplied to the method. The output returned is in the form of a table. This method can be used to cross-tabulation and statistical analysis.

table (fac-vec, .. )

The output is a data frame with row headings as the unique values of the data frame and the column headings as the column names of the original data frame, where each cell value indicates the number of occurrences of that row heading variable in the respective column.

Code:

R

set.seed(1)   
  
# creating a data frame 
data_table <- data.frame(col1 =  sample(letters[1:3], 8, 
                                        replace = TRUE) , 
                         col2 = sample(letters[1:3], 8, 
                                       replace = TRUE), 
                         col3 = sample(letters[1:3], 8, 
                                       replace = TRUE), 
                         col4 = sample(letters[1:3], 8, 
                                       replace = TRUE)) 
  
print ("Original DataFrame") 
print (data_table) 
  
# compute unique levels in data frame 
lvls <- unique(unlist(data_table)) 
  
# apply the summation per value  
freq <- sapply(data_table,  
               function(x) table(factor(x, levels = lvls,  
                                        ordered = TRUE))) 
print ("Count of variables per column") 
print (freq) 

Output:

[1] "Original DataFrame"
  col1 col2 col3 col4
1    a    b    b    a
2    c    c    b    b
3    a    c    c    a
4    b    a    a    a
5    a    a    c    b
6    c    a    a    b
7    c    b    a    b
8    b    b    a    a 
[1] "Count of variables per column" 
  col1 col2 col3 col4 
a    3    3    4    4 
c    3    2    2    0 
b    2    3    2    4

Suggest improvement

Get Standard Deviation of a Column in R dataframe

How to find the difference in value in every two consecutive rows in R DataFrame ?

Share your thoughts in the comments