Skip to content
Related Articles

Related Articles

Improve Article

Grouping Categorical Variables in Pandas Dataframe

  • Last Updated : 17 Aug, 2020

Firstly, we have to understand what are Categorical variables in pandas. Categorical are the datatype available in pandas library of python. A categorical variable takes only a fixed category (usually fixed number) of values. Some examples of Categorical variables are gender, blood group, language etc. One main contrast with these variables are that no mathematical operations can be performed with these variables.

A dataframe can be created in pandas consisting of categorical values using Dataframe constructor and specifying dtype = ”category”

Python3




# importing pandas as pd 
import pandas as pd 
  
# Create the dataframe 
# with categorical variable 
df = pd.DataFrame({'A': ['a', 'b', 'c',
                         'c', 'a', 'b'],
                   'B': [0, 1, 1, 0, 1, 0]},
                  dtype = "category")
# show the data types
df.dtypes

Output:
 

datatypes of dataframe



Here one important thing is that categories generated in each column are not same, conversion is done column by column as we can see here:

Output:

Now, in some works, we need to group our categorical data. This is done using the groupby() method given in pandas. It returns all the combinations of groupby columns. Along with groupyby we have to pass an aggregate function with it to ensure that on what basis we are going to group our variables. Some aggregate function are mean(), sum(), count() etc.

Now applying our groupby() along with count() function.

Python3




# initial state
print(df)
  
# counting number of each category
print(df.groupby(['A']).count().reset_index())

Output:

dataframe

Group by column ‘A’



Now, one more example with mean() function. Here column A is converted to categorical and all other are numerical and mean is calculated according to categories of column A and column B.

Python3




# importing pandas as pd 
import pandas as pd 
  
# Create the dataframe 
df = pd.DataFrame({'A': ['a', 'b', 'c'
                         'c', 'a', 'b'], 
                   'B': [0, 1, 1
                         0, 1, 0], 
                   'C':[7, 8, 9,
                        5, 3, 6]})
  
# change tha datatype of 
# column 'A' into category
# data type
df['A'] = df['A'].astype('category')                                                                                                                                                                                                   
  
# initial state
print(df)
  
# calculating mean with 
# all combinations of A and B
print(df.groupby(['A','B']).mean().reset_index())

  

Output:

Dataframe

Group by both column ‘A’ and ‘B’

Other aggregate functions are also implemented in the same way using groupby().

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :