Pandas dataframe.groupby() Method

Last Updated : 04 Sep, 2023

Pandas groupby is used for grouping the data according to the categories and applying a function to the categories. It also helps to aggregate data efficiently. The Pandas groupby() is a very powerful function with a lot of variations. It makes the task of splitting the Dataframe over some criteria really easy and efficient.

Pandas dataframe.groupby()

Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. Pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Parameters :

by : mapping, function, str, or iterable

axis : int, default 0

level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : When calling apply, add group keys to index to identify pieces

squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns : GroupBy object

Dataset Used: For a link to the CSV file Used in Code, click here

Example 1: Use groupby() function to group the data based on the “Team”.

Python3

# importing pandas as pd
import pandas as pd
 
# Creating the dataframe
df = pd.read_csv("nba.csv")
 
# Print the dataframe
print(df.head())

Output:

            Name            Team  Number Position   Age Height  Weight            College     Salary
                                                                  
0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0
1  Jae Crowder    Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0
2  John Holland   Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN
3  R.J. Hunter    Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0
4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0

Now apply the groupby() function.

Python3

# applying groupby() function to
# group the data on team value.
gk = df.groupby('Team')
 
# Let's print the first entries
# in all the groups formed.
gk.first()

Output :

                                         Name  Number Position   Age Height   Weight                College      Salary
Team                                                                          
Atlanta Hawks                   Kent Bazemore    24.0       SF  26.0    6-5    201.0           Old Dominion   2000000.0 
Boston Celtics                  Avery Bradley     0.0       PG  25.0    6-2    180.0                  Texas   7730337.0
Brooklyn Nets                Bojan Bogdanovic    44.0       SG  27.0    6-8    216.0         Oklahoma State   3425510.0
Charlotte Hornets               Nicolas Batum     5.0       SG  27.0    6-8    200.0  Virginia Commonwealth  13125306.0
Chicago Bulls                Cameron Bairstow    41.0       PF  25.0    6-9    250.0             New Mexico    845059.0
Cleveland Cavaliers       Matthew Dellavedova     8.0       PG  25.0    6-4    198.0           Saint Mary's   1147276.0
Dallas Mavericks              Justin Anderson     1.0       SG  22.0    6-6    228.0               Virginia   1449000.0
Denver Nuggets                 Darrell Arthur     0.0       PF  28.0    6-9    235.0                 Kansas   2814000.0
Detroit Pistons                  Joel Anthony    50.0        C  33.0    6-9    245.0                   UNLV   2500000.0
Golden State Warriors         Leandro Barbosa    19.0       SG  33.0    6-3    194.0         North Carolina   2500000.0
Houston Rockets                  Trevor Ariza     1.0       SF  30.0    6-8    215.0                   UCLA   8193030.0
Indiana Pacers                    Lavoy Allen     5.0       PF  27.0    6-9    255.0                 Temple   4050000.0
Los Angeles Clippers             Cole Aldrich    45.0        C  27.0   6-11    250.0                 Kansas   1100602.0
Los Angeles Lakers               Brandon Bass     2.0       PF  31.0    6-8    250.0                    LSU   3000000.0
Memphis Grizzlies                Jordan Adams     3.0       SG  21.0    6-5    209.0                   UCLA   1404600.0
Miami Heat                         Chris Bosh     1.0       PF  32.0   6-11    235.0           Georgia Tech  22192730.0
Milwaukee Bucks         Giannis Antetokounmpo    34.0       SF  21.0   6-11    222.0                Arizona   1953960.0

Let’s print the value contained in any one of the groups. For that use the name of the team. We use the function get_group() to find the entries contained in any of the groups.

Python3

# Finding the values contained in the "Boston Celtics" group
gk.get_group('Boston Celtics')

Output :

               Name            Team  Number Position   Age Height  Weight            College      Salary
0     Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas   7730337.0
1       Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette   6796117.0
2      John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University         NaN
3       R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State   1148640.0  
4     Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN   5000000.0  
5      Amir Johnson  Boston Celtics    90.0       PF  29.0    6-9   240.0                NaN  12000000.0 
6     Jordan Mickey  Boston Celtics    55.0       PF  21.0    6-8   235.0                LSU   1170960.0 
7      Kelly Olynyk  Boston Celtics    41.0        C  25.0    7-0   238.0            Gonzaga   2165160.0  
8      Terry Rozier  Boston Celtics    12.0       PG  22.0    6-2   190.0         Louisville   1824360.0 
9      Marcus Smart  Boston Celtics    36.0       PG  22.0    6-4   220.0     Oklahoma State   3431040.0  
10  Jared Sullinger  Boston Celtics     7.0        C  24.0    6-9   260.0         Ohio State   2569260.0
11    Isaiah Thomas  Boston Celtics     4.0       PG  27.0    5-9   185.0         Washington   6912869.0 
12      Evan Turner  Boston Celtics    11.0       SG  27.0    6-7   220.0         Ohio State   3425510.0
13      James Young  Boston Celtics    13.0       SG  20.0    6-6   215.0           Kentucky   1749840.0 
14     Tyler Zeller  Boston Celtics    44.0        C  26.0    7-0   253.0     North Carolina   2616975.0

Example 2: Use groupby() function to form groups based on more than one category (i.e. Use more than one column to perform the splitting).

Python3

# importing pandas as pd
import pandas as pd
 
# Creating the dataframe
df = pd.read_csv("nba.csv")
 
# First grouping based on "Team"
# Within each team we are grouping based on "Position"
gkk = df.groupby(['Team', 'Position'])
 
# Print the first value in each group
gkk.first()

Output :

                                         Name  Number   Age Height  Weight              College      Salary
Team               Position                                                  
Atlanta Hawks      C               Al Horford    15.0  30.0   6-10   245.0              Florida  12000000.0 
                   PF          Kris Humphries    43.0  31.0    6-9   235.0            Minnesota   1000000.0  
                   PG         Dennis Schroder    17.0  22.0    6-1   172.0          Wake Forest   1763400.0  
                   SF           Kent Bazemore    24.0  26.0    6-5   201.0         Old Dominion   2000000.0  
                   SG        Tim Hardaway Jr.    10.0  24.0    6-6   205.0             Michigan   1304520.0 
...                                       ...     ...   ...    ...     ...                  ...         ...
Washington Wizards C            Marcin Gortat    13.0  32.0   6-11   240.0 North Carolina State  11217391.0  
                   PF             Drew Gooden    90.0  34.0   6-10   250.0               Kansas   3300000.0  
                   PG          Ramon Sessions     7.0  30.0    6-3   190.0               Nevada   2170465.0  
                   SF            Jared Dudley     1.0  30.0    6-7   225.0       Boston College   4375000.0  
                   SG           Alan Anderson     6.0  33.0    6-6   220.0       Michigan State   4000000.0

Suggest improvement

Python | Delete rows/columns from DataFrame using Pandas.drop()

Pandas DataFrame corr() Method

Share your thoughts in the comments

Pandas dataframe.groupby() Method