Data Processing with Pandas

Last Updated : 27 Dec, 2023

Data Processing is an important part of any task that includes data-driven work. It helps us to provide meaningful insights from the data. As we know Python is a widely used programming language, and there are various libraries and tools available for data processing.

In this article, we are going to see Data Processing in Python, Loading, Printing rows and Columns, Data frame summary, Missing data values Sorting and Merging Data Frames, Applying Functions, and Visualizing Dataframes.

Table of Content

What is Data Processing in Python?
What is Pandas?
Loading Data in Pandas DataFrame
Printing rows of the Data
Printing the column names of the DataFrame
Summary of Data Frame
Descriptive Statistical Measures of a DataFrame
Missing Data Handing
Sorting DataFrame values
Merge Data Frames
Apply Function
By using the lambda operator
Visualizing DataFrame
Conclusion

What is Data Processing in Python?

Data processing in Python refers to manipulating, transforming, and analyzing data by using Python. It contains a series of operations that aim to change raw data into structured data. or meaningful insights. By converting raw data into meaningful insights it makes it suitable for analysis, visualization, or other applications.Python provides several libraries and tools that facilitate efficient data processing, making it a popular choice for working with diverse datasets.

Refer to this article – Introduction to Data Processing

What is Pandas?

Pandas is a powerful, fast, and open-source library built on NumPy. It is used for data manipulation and real-world data analysis in Python. Easy handling of missing data, Flexible reshaping and pivoting of data sets, and size mutability make pandas a great tool for performing data manipulation and handling the data efficiently.

Loading Data in Pandas DataFrame

Reading CSV file using pd.read_csv and loading data into a data frame. Import pandas as using pd for the shorthand. You can download the data from here.

Python3

#Importing pandas library
import pandas as pd
 
#Loading data into a DataFrame
data_frame=pd.read_csv('Mall_Customers.csv')

Printing rows of the Data

By default, data_frame.head() displays the first five rows and data_frame.tail() displays last five rows. If we want to get first ‘n’ number of rows then we use, data_frame.head(n) similar is the syntax to print the last n rows of the data frame.

Python3

#displaying first five rows
display(data_frame.head())
 
#displaying last five rows
display(data_frame.tail())

Output:

   CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40
     CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)
195         196  Female   35                 120                      79
196         197  Female   45                 126                      28
197         198    Male   32                 126                      74
198         199    Male   32                 137                      18
199         200    Male   30                 137                      83
[4]
0s
# Program to print all the column name of the dataframe
print(list(data_frame.columns))

Printing the column names of the DataFrame

Python3

# Program to print all the column name of the dataframe
print(list(data_frame.columns))

Output:

['CustomerID', 'Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']

Summary of Data Frame

The functions info() prints the summary of a DataFrame that includes the data type of each column, RangeIndex (number of rows), columns, non-null values, and memory usage.

Python3

data_frame.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Genre                   200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB

Descriptive Statistical Measures of a DataFrame

The describe() function outputs descriptive statistics which include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. For numeric data, the result’s index will include count, mean, std, min, and max as well as lower, 50, and upper percentiles. For object data (e.g. strings), the result’s index will include count, unique, top, and freq.

Python3

data_frame.describe()

Output:

       CustomerID         Age  Annual Income (k$)  Spending Score (1-100)
count  200.000000  200.000000          200.000000              200.000000
mean   100.500000   38.850000           60.560000               50.200000
std     57.879185   13.969007           26.264721               25.823522
min      1.000000   18.000000           15.000000                1.000000
25%     50.750000   28.750000           41.500000               34.750000
50%    100.500000   36.000000           61.500000               50.000000
75%    150.250000   49.000000           78.000000               73.000000
max    200.000000   70.000000          137.000000               99.000000

Missing Data Handing

Find missing values in the dataset

The isnull( ) detects the missing values and returns a boolean object indicating if the values are NA. The values which are none or empty get mapped to true values and not null values get mapped to false values.

Python3

data_frame.isnull( )

Output:

     CustomerID  Genre    Age  Annual Income (k$)  Spending Score (1-100)
0         False  False  False               False                   False
1         False  False  False               False                   False
2         False  False  False               False                   False
3         False  False  False               False                   False
4         False  False  False               False                   False
..          ...    ...    ...                 ...                     ...
195       False  False  False               False                   False
196       False  False  False               False                   False
197       False  False  False               False                   False
198       False  False  False               False                   False
199       False  False  False               False                   False
[200 rows x 5 columns]
[8]
0s

Find the number of missing values in the dataset

To find out the number of missing values in the dataset, use data_frame.isnull( ).sum( ). In the below example, the dataset doesn’t contain any null values. Hence, each column’s output is 0.

Python3

data_frame.isnull().sum()

Output:

CustomerID                0
Genre                     0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

Removing missing values

The data_frame.dropna( ) function removes columns or rows which contains atleast one missing values.

data_frame = data_frame.dropna()

By default, data_frame.dropna( ) drops the rows where at least one element is missing. data_frame.dropna(axis = 1) drops the columns where at least one element is missing.

Fill in missing values

We can fill null values using data_frame.fillna( ) function.

 data_frame = data_frame.fillna(value)

But by using the above format all the null values will get filled with the same values. To fill different values in the different columns we can use.

data_frame[col] = data_frame[col].fillna(value)

Row and column manipulations

Removing rows

By using the drop(index) function we can drop the row at a particular index. If we want to replace the data_frame with the row removed then add inplace = True in the drop function.

Python3

#Removing 4th indexed value from the dataframe
data_frame.drop(4).head()

Output:

   CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
5           6  Female   22                  17                      76
[ ]

This function can also be used to remove the columns of a data frame by adding the attribute axis =1 and providing the list of columns we would like to remove.

Renaming rows

The rename function can be used to rename the rows or columns of the data frame.

Python3

data_frame.rename({0:"First",1:"Second"})

Output:

        CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)
First            1    Male   19                  15                      39
Second           2    Male   21                  15                      81
2                3  Female   20                  16                       6
3                4  Female   23                  16                      77
4                5  Female   31                  17                      40
...            ...     ...  ...                 ...                     ...
195            196  Female   35                 120                      79
196            197  Female   45                 126                      28
197            198    Male   32                 126                      74
198            199    Male   32                 137                      18
199            200    Male   30                 137                      83
[200 rows x 5 columns]

Adding new columns

Python3

#Creates a new column with all the values equal to 1
data_frame['NewColumn'] = 1
data_frame.head()

Output:

   CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)  \
0           1    Male   19                  15                      39   
1           2    Male   21                  15                      81   
2           3  Female   20                  16                       6   
3           4  Female   23                  16                      77   
4           5  Female   31                  17                      40   
   NewColumn  
0          1  
1          1  
2          1  
3          1  
4          1

Sorting DataFrame values

Sort by column

The sort_values( ) are the values of the column whose name is passed in the by attribute in the ascending order by default we can set this attribute to false to sort the array in the descending order.

Python3

data_frame.sort_values(by='Age', ascending=False).head()

Output:

    CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)  \
70          71    Male   70                  49                      55   
60          61    Male   70                  46                      56   
57          58    Male   69                  44                      46   
90          91  Female   68                  59                      55   
67          68  Female   68                  48                      48   
    NewColumn  
70          1  
60          1  
57          1  
90          1  
67          1

Sort by multiple columns

Python3

data_frame.sort_values(by=['Age','Annual Income (k$)']).head(10)

Output:

     CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)  \
33           34    Male   18                  33                      92   
65           66    Male   18                  48                      59   
91           92    Male   18                  59                      41   
114         115  Female   18                  65                      48   
0             1    Male   19                  15                      39   
61           62    Male   19                  46                      55   
68           69    Male   19                  48                      59   
111         112  Female   19                  63                      54   
113         114    Male   19                  64                      46   
115         116  Female   19                  65                      50   
     NewColumn  
33           1  
65           1  
91           1  
114          1  
0            1  
61           1  
68           1  
111          1  
113          1  
115          1

Merge Data Frames

The merge() function in pandas is used for all standard database join operations. Merge operation on data frames will join two data frames based on their common column values. Let’s create a data frame.

Python3

#Creating dataframe1
df1 = pd.DataFrame({
        'Name':['Jeevan', 'Raavan', 'Geeta', 'Bheem'], 
        'Age':[25, 24, 52, 40], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']})
df1

Output:

     Name  Age Qualification
0  Jeevan   25           Msc
1  Raavan   24            MA
2   Geeta   52           MCA
3   Bheem   40           Phd

Now we will create another data frame.

Python3

#Creating dataframe2
df2 = pd.DataFrame({'Name':['Jeevan', 'Raavan', 'Geeta', 'Bheem'],
                    'Salary':[100000, 50000, 20000, 40000]})
df2

Output:

     Name  Salary
0  Jeevan  100000
1  Raavan   50000
2   Geeta   20000
3   Bheem   40000

Now. let’s merge these two data frames created earlier.

Python3

#Merging two dataframes
df = pd.merge(df1, df2)
df

Output:

     Name  Age Qualification  Salary
0  Jeevan   25           Msc  100000
1  Raavan   24            MA   50000
2   Geeta   52           MCA   20000
3   Bheem   40           Phd   40000

Apply Function

By defining a function beforehand

The apply( ) function is used to iterate over a data frame. It can also be used with lambda functions.

Python3

# Apply function
def fun(value):
    if value > 70:
        return "Yes"
    else:
        return "No"
 
data_frame['Customer Satisfaction'] = data_frame['Spending Score (1-100)'].apply(fun)
data_frame.head(10)

Output:

   CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)  \
0           1    Male   19                  15                      39   
1           2    Male   21                  15                      81   
2           3  Female   20                  16                       6   
3           4  Female   23                  16                      77   
4           5  Female   31                  17                      40   
5           6  Female   22                  17                      76   
6           7  Female   35                  18                       6   
7           8  Female   23                  18                      94   
8           9    Male   64                  19                       3   
9          10  Female   30                  19                      72   
   NewColumn Customer Satisfaction  
0          1                    No  
1          1                   Yes  
2          1                    No  
3          1                   Yes  
4          1                    No  
5          1                   Yes  
6          1                    No  
7          1                   Yes  
8          1                    No  
9          1                   Yes

By using the lambda operator

This syntax is generally used to apply log transformations and normalize the data to bring it in the range of 0 to 1 for particular columns of the data.

Python3

const = data_frame['Age'].max()
data_frame['Age'] = data_frame['Age'].apply(lambda x: x/const)
data_frame.head()

Output:

   CustomerID   Genre       Age  Annual Income (k$)  Spending Score (1-100)  \
0           1    Male  0.271429                  15                      39   
1           2    Male  0.300000                  15                      81   
2           3  Female  0.285714                  16                       6   
3           4  Female  0.328571                  16                      77   
4           5  Female  0.442857                  17                      40   
   NewColumn Customer Satisfaction  
0          1                    No  
1          1                   Yes  
2          1                    No  
3          1                   Yes  
4          1                    No

Visualizing DataFrame

Scatter plot

The plot( ) function is used to make plots of the data frames.

Python3

# Visualization
data_frame.plot(x ='CustomerID', y='Spending Score (1-100)',kind = 'scatter')

Output:

Scatter plot of the Customer Satisfaction column

Histogram

The plot.hist( ) function is used to make plots of the data frames.

Python3

data_frame.plot.hist()

Output:

Histogram for the distribution of the data

Conclusion

There are other functions as well of pandas data frame but the above mentioned are some of the common ones generally used for handling large tabular data. One can refer to the pandas documentation as well to explore more about the functions mentioned above.

Suggest improvement

Processing time with Pandas DataFrame

Share your thoughts in the comments

Data Processing with Pandas

What is Data Processing in Python?

What is Pandas?

Loading Data in Pandas DataFrame

Python3

Printing rows of the Data

Python3

Printing the column names of the DataFrame

Python3

Summary of Data Frame

Python3

Descriptive Statistical Measures of a DataFrame

Python3

Missing Data Handing

Find missing values in the dataset

Python3

Find the number of missing values in the dataset

Python3

Removing missing values

Fill in missing values

Row and column manipulations

Removing rows

Python3

Renaming rows

Python3

Adding new columns

Python3

Sorting DataFrame values

Sort by column

Python3

Sort by multiple columns

Python3

Merge Data Frames

Python3

Python3

Python3

Apply Function

By defining a function beforehand

Python3

By using the lambda operator

Python3

Visualizing DataFrame

Scatter plot

Python3

Histogram

Python3

Conclusion

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?