Skip to content
Related Articles

Related Articles

Use Pandas to Calculate Statistics in Python
  • Last Updated : 10 Jul, 2020

Performing various complex statistical operations in python can be easily reduced to single line commands using pandas. We will discuss some of the most useful and common statistical operations in this post. We will be using the Titanic survival dataset to demonstrate such operations.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Import Pandas Library
import pandas as pd
  
# Load Titanic Dataset as Dataframe
dataset = pd.read_csv('train.csv')
  
# Show dataset
# head() bydefault show 
# 5 rows of the dataframe
dataset.head()

chevron_right


Output:

Titanic dataframe

1. Mean:

Calculates the mean or average value by using DataFrame/Series.mean() method.



Syntax: DataFrame/Series.mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

Parameters:

  • axis: {index (0), columns (1)}

          Specify the axis for the function to be applied on.

  • skipna:  This parameter takes bool value, default value is True

           It excludes null values when computing the result.

  • level: This parameter takes int value or level name, default value is None.

          If the axis is a MultiIndex, count along a particular level, collapsing into a Series.

  • numeric_only: This parameter takes bool value, default value is None

           Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric  data values. Not implemented for Series.

  • **kwargs: Additional arguments to be passed to the function.

Returns:  Mean of Series or DataFrame (if level specified)

Code:



Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Calculate the Mean 
# of 'Age' column
mean = dataset['Age'].mean()
  
# Print mean
print(mean)

chevron_right


Output: 

29.69911764705882

2. Median:

Calculates the median value by using DataFrame/Series.median() method.

Syntax: DataFrame/Series.median(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

Parameters:

  • axis: {index (0), columns (1)}

          Specify the axis for the function to be applied on.

  • skipna:  This parameter takes bool value, default value is True

          It excludes null values when computing the result.

  • level: This parameter takes int or level name, default None

          If the axis is a MultiIndex, count along a particular level, collapsing into a Series.

  • numeric_only:  This parameter takes bool value, default value is None

          Include only float, int, boolean columns. If value is None, will attempt to use everything, then use only  numeric data.

  • **kwargs: Additional arguments to be passed to the function.

Returns:  Median of Series or DataFrame (if level specified)

Code:

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Calculate Median of 'Fare' column
median = dataset['Fare'].median()
  
# Print median
print(median)

chevron_right


Output: 

14.4542

3. Mode:

Calculates the mode or most frequent value by using DataFrame.mode() method.

Syntax: DataFrame/Series.mode(self, axis=0, numeric_only=False, dropna=True)

Parameters:

  • axis: {index (0), columns (1)}

          The axis to iterate over while searching for the mode value:

          0 value or ‘index’ : get mode of each column

          1 value or ‘columns’ : get mode of each row.

  • numeric_only:  This parameter takes bool value, default value is False.

           If True, only apply to numeric value columns.

  • dropna: This parameter takes bool value, default value is True.

           Don’t consider counts of NaN/None value.

Returns: Highest frequency value. 

Code:

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Calculate Mode of 'Sex' column
mode = dataset['Sex'].mode()
  
# Print mode
print(mode)

chevron_right


Output: 

0    male
dtype: object

4. Count:

Calculates the count or frequency of non-null values by using DataFrame/Series.count() Method.

Syntax: DataFrame/Series.count(self, axis=0, level=None, numeric_only=False)

Parameters:

  • axis: {0 or ‘index’, 1 or ‘columns’}, default value is 0

          If value is 0 or ‘index’ counts are generated for each column. If value is 1 or ‘columns’ counts are                         generated for each row.

  • level: (optional)This parameter takes int or str value.

          If the axis is a MultiIndex type, count along a particular level, collapsing into a DataFrame. A str is used   specifies the level name.

  • numeric_only:  This parameter takes bool value, default False

          Include only float, int or boolean data.Returns: Return the highest frequency value 

Returns: For each column/row the number of non-null entries. If level is specified returns a DataFrame                              structure.

Code:

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Calulate Count of 'Ticket' column
count = dataset['Ticket'].count()
  
# Print count
print(count)

chevron_right


Output: 

891

5. Standard Deviation:

Calculates the standard deviation of values by using DataFrame/Series.std() method.

Syntax: DataFrame/Series.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

Parameters:

  • axis: {index (0), columns (1)}
  • skipna: This parameters takes bool value, default value is True.

          Exclude NA/null values. If an entire row/column has NA values, the result will be NA value.

  • level: This parameters takes int or level name, default value is None.

          If the axis is a MultiIndex, count along a particular level, collapsing into a Series.

  • ddof: This parameter take int value, default value is 1.

          Delta Degrees of Freedom. The divisor used in calculations is N – ddof, where N value represents the  number of elements.

  • numeric_only: This parameter takes bool value , default None

          Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric  data. Not implemented for Series.

Returns: Standard Deviation 

Code:

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Calulate Standard Deviation
# of 'Fare' column
std = dataset['Fare'].std()
  
# Print standard deviation
print(std)

chevron_right


Output

49.693428597180905

6. Max:

Calculates the maximum value using DataFrame/Series.max() method.

Syntax: DataFrame/Series.max(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

Parameters:

  • axis: {index (0), columns (1)}

          Specify the axis for the function to be applied on.

  • skipna: bool, default True

          It excludes null values when computing the result.

  • level: int or level name, default None

          If the axis is a MultiIndex type, count along a particular level, collapsing into a Series.

  • numeric_only: bool, default None

           Include only float, int, boolean columns. If None value, will attempt to use everything, then use only  numeric data.

  • **kwargs: Additional keyword to be passed to the function.

Returns: Maximum value in Series or DataFrame (if level specified)

Code:

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Calulate Maximum value in 'Age' column
maxValue = dataset['Age'].max()
  
# Print maxValue
print(maxValue)

chevron_right


Output

80.0

7. Min:

Calculates the minimum value using DataFrame/Series.min() method.

Syntax: DataFrame/Series.min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

Parameters:

  • axis: {index (0), columns (1)}

          Specify the axis for the function to be applied on.

  • skipna: bool, default True

          It excludes null values when computing the result.

  • level: int or level name, default None

          If the axis is a MultiIndex type, count along a particular level, collapsing into a Series.

  • numeric_only: bool, default None

           Include only float, int, boolean columns. If None value, will attempt to use everything, then use only  numeric data.

  • **kwargs: Additional keyword to be passed to the function.

Returns: Minimum value in Series or DataFrame (if level specified)

Code:

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Calulate Minimum value in 'Fare' column
minValue = dataset['Fare'].min()
  
# Print minValue
print(minValue)

chevron_right


Output: 

0.0000

8. Describe:

Summarizes general descriptive statistics using DataFrame/Series.describe() method.

Syntax: DataFrame/Series.describe(self: ~ FrameOrSeries, percentiles=None, include=None, exclude=None) 

Parameters:

  • percentiles: list-like of numbers, optional
  • include: ‘all’, list-like of dtypes or None values (default), optional
  • exclude: list-like of dtypes or None values (default), optional,

Returns: Summary statistics of the Series or Dataframe provided.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Statistical summary
dataset.describe()

chevron_right


Output:

Titanic dataframe describe

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.

My Personal Notes arrow_drop_up
Recommended Articles
Page :