Pandas Cut – Continuous to Categorical
Numerical data such as continuous, highly skewed data is frequently seen in data analysis. Sometimes analysis becomes effortless on conversion from continuous to discrete data. There are many ways in which conversion can be done, one such way is by using Pandas’ integrated cut-function. Pandas’ cut function is a distinguished way of converting numerical continuous data into categorical data. It has 3 major necessary parts:
- First and foremost is the 1-D array/DataFrame required for input.
- The other main part is bins. Bins that represent boundaries of separate bins for continuous data. The first number denotes the start point of the bin and the following number denotes the endpoint of the bin. Cut function permits more explicitness of the bins
- The final main part is labels. The number of labels without exception will be one lower than the number of bins.
Note: For any NA values, the result will be stored as NA. Out of bounds values will also be NA in the resultant categorical bins.
On using the pandas cut function, it fails to guarantee the distribution of values in each bin. As a matter of fact, we might end up defining bins in such a way that the bin may not contain any value.
Syntax: pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)
Parameters:
- x: Input array. Need to be 1-dimensional.
- bins: Denotes the bin boundaries for segmentation
- right: Denotes whether rightmost edge of bins should be included or not. Boolean type of value. Default value is True.
- labels: Defines labels for returned segmented bins. Array or boolean
Return Value: Returns a Categorical series/numpy array/IntervalIndex
Example 1: Let’s say we have an array ‘Age’ of 15 random numbers from 1 to 100 and we wish to separate data into 4 bins of categories –
'Baby/Toddler' :- 0 to 3 years
'Child' :- 4 to 17 years
'Adult' :- 18 to 63 years
'Elderly' :- 64 to 99 years
Python3
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Age' : [ 42 , 15 , 67 , 55 , 1 , 29 , 75 , 89 , 4 ,
10 , 15 , 38 , 22 , 77 ]})
print ( "Before: " )
print (df)
df[ 'Label' ] = pd.cut(x = df[ 'Age' ], bins = [ 0 , 3 , 17 , 63 , 99 ],
labels = [ 'Baby/Toddler' , 'Child' , 'Adult' ,
'Elderly' ])
print ( "After: " )
print (df)
print ( "Categories: " )
print (df[ 'Label' ].value_counts())
|
Output:
Before:
Age
0 42
1 15
2 67
3 55
4 1
5 29
6 75
7 89
8 4
9 10
10 15
11 38
12 22
13 77
After:
Age Label
0 42 Adult
1 15 Child
2 67 Elderly
3 55 Adult
4 1 Baby/Toddler
5 29 Adult
6 75 Elderly
7 89 Elderly
8 4 Child
9 10 Child
10 15 Child
11 38 Adult
12 22 Adult
13 77 Elderly
Categories:
Adult 5
Elderly 4
Child 4
Baby/Toddler 1
Name: Label, dtype: int64
Example #2: Let’s say we have an array ‘Height’ of 12 random people starting from 150cm to 180cm and we wish to separate data into 3 bins of categories.
'Short' :- greater than 150cm upto 157cm
'Average' :- greater than 157cm upto 170cm
'Tall' :- greater than 170cm upto 180cm
Python3
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Height' : [ 150.4 , 157.6 , 170 , 176 , 164.2 , 155 ,
159.2 , 175 , 162.4 , 176 , 153 , 170.9 ]})
print ( "Before: " )
print (df)
df[ 'Label' ] = pd.cut(x = df[ 'Height' ],
bins = [ 150 , 157 , 169 , 180 ],
labels = [ 'Short' , 'Average' , 'Tall' ])
print ( "After: " )
print (df)
print ( "Categories: " )
print (df[ 'Label' ].value_counts())
|
Output:
Before:
Height
0 150.4
1 157.6
2 170.0
3 176.0
4 164.2
5 155.0
6 159.2
7 175.0
8 162.4
9 176.0
10 153.0
11 170.9
After:
Height Label
0 150.4 Short
1 157.6 Average
2 170.0 Tall
3 176.0 Tall
4 164.2 Average
5 155.0 Short
6 159.2 Average
7 175.0 Tall
8 162.4 Average
9 176.0 Tall
10 153.0 Short
11 170.9 Tall
Categories:
Tall 5
Average 4
Short 3
Name: Label, dtype: int64
Last Updated :
28 Nov, 2021
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...