Prerequisites : NumPy in Python, Data Analysis Visualization with Python
Python is very well known for Data analysis and visualizations because of the vast libraries it provides such as Pandas, Numpy, Matplotlib, etc. Today we will learn some methods to understand our data better and to gain some useful insights from it.
1. Storing DataFrame in CSV Format :
Pandas provide to.csv('filename', index = "False|True")
a function to write DataFrame into a CSV file. Here filename
is the name of the CSV file that you want to create and index
tells the index (if Default) of DataFrame should be overwritten or not. If we set index = False
then the index is not overwritten. By Default value of the index is TRUE
then the index is overwritten.
Example :
Python3
import pandas as pd
s1 = pd.Series([ 0 , 4 , 8 ])
s2 = pd.Series([ 1 , 5 , 9 ])
s3 = pd.Series([ 2 , 6 , 10 ])
dframe = pd.DataFrame([s1, s2, s3])
dframe.columns = [ 'Geeks' , 'For' , 'Geeks' ]
dframe.to_csv( 'geeksforgeeks.csv' , index = False )
dframe.to_csv( 'geeksforgeeks1.csv' , index = True )
|
Output :
geeksforgeeks.csv:
Geeks For Geeks.1
0 0 4 8
1 1 5 9
2 2 6 10
geeksforgeeks1.csv:
Unnamed: 0 Geeks For Geeks.1
0 0 0 4 8
1 1 1 5 9
2 2 2 6 10
2. Handling Missing Data
The Data Analysis Phase also comprises the ability to handle the missing data from our dataset, and not so surprisingly Pandas live up to that expectation as well. This is where dropna
and/or fillna
methods come into play. While dealing with the missing data, you as a Data Analyst are either supposed to drop the column containing the NaN values (dropna method) or fill in the missing data with the mean or mode of the whole column entry (fillna method), this decision is of great significance and depends upon the data and the effect would create in our results.
Drop the missing Data: Let’s create a dataframe with null values :
Python3
import pandas as pd
dframe = pd.DataFrame({ 'Geeks' : [ 23 , 24 , 22 ],
'For' : [ 10 , 12 , np.nan],
'geeks' : [ 0 , np.nan, np.nan]},
columns = [ 'Geeks' , 'For' , 'geeks' ])
print ( "Dataframe: " )
print (dframe)
dframe.dropna(inplace = True )
print ( "Dropping Null axis = 0" )
print (dframe)
|
Output :
DataFrame:
Geeks For geeks
0 23 10.0 0.0
1 24 12.0 NaN
2 22 NaN NaN
Dropping Null axis = 0
Geeks For geeks
0 23 10.0 0.0
Dropping columns:
Python3
dframe = pd.DataFrame({ 'Geeks' : [ 23 , 24 , 22 ],
'For' : [ 10 , 12 , np.nan],
'geeks' : [ 0 , np.nan, np.nan]},
columns = [ 'Geeks' , 'For' , 'geeks' ])
dframe.dropna(axis = 1 , inplace = True )
print (dframe)
|
Output:
Geeks
0 23
1 24
2 22
Fill the missing values : Now, to replace any NaN value with mean or mode of the data, fillna
is used, which could replace all the NaN values from a particular column or even in whole DataFrame as per the requirement.
Python3
import numpy as np
import pandas as pd
dframe = pd.DataFrame({ 'Geeks' : [ 23 , 24 , 22 ],
'For' : [ 10 , 12 , np.nan],
'geeks' : [ 0 , np.nan, np.nan]},
columns = [ 'Geeks' , 'For' , 'geeks' ])
dframe.fillna(value = dframe.mean(), inplace = True )
print (dframe)
|
Output :
Geeks For geeks
0 23 10.0 0.0
1 24 12.0 0.0
2 22 11.0 0.0
Filling value of one column:
Python3
dframe = pd.DataFrame({ 'Geeks' : [ 23 , 24 , 22 ],
'For' : [ 10 , 12 , np.nan],
'geeks' : [ 0 , np.nan, np.nan]},
columns = [ 'Geeks' , 'For' , 'geeks' ])
dframe[ 'For' ].fillna(value = dframe[ 'For' ].mean(),
inplace = True )
print (dframe)
|
Output:
Geeks For geeks
0 23 10.0 0.0
1 24 12.0 NaN
2 22 11.0 NaN
3. Groupby Method (Aggregation) :
The groupby method allows us to group together the data based on any row or column, thus we can further apply the aggregate functions to analyze our data. Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns. Consider a DataFrame generated by below code :
Python3
import pandas as pd
import numpy as np
dframe = pd.DataFrame({ 'Geeks' : [ 23 , 24 , 22 , 22 , 23 , 24 ],
'For' : [ 10 , 12 , 13 , 14 , 15 , 16 ],
'geeks' : [ 122 , 142 , 112 , 122 , 114 , 112 ]},
columns = [ 'Geeks' , 'For' , 'geeks' ])
print ( "After groupby: " )
print (dframe.groupby([ 'Geeks' ]). max ())
|
Output :
Geeks For geeks
0 23 10 122
1 24 12 142
2 22 13 112
3 22 14 122
4 23 15 114
5 24 16 112
After groupby:
For geeks
Geeks
22 14 122
23 15 122
24 16 142
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out -
check it out now!
Last Updated :
09 Sep, 2023
Like Article
Save Article