DataFrame vs Series in Pandas
Last Updated :
17 Feb, 2024
Pandas is a widely-used Python library for data analysis that provides two essential data structures: Series and DataFrame. These structures are potent tools for handling and examining data, but they have different features and applications.
In this article, we will explore the differences between Series and DataFrames.
What are pandas?
Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use data structures like DataFrame and Series, which are designed to make working with structured data fast, easy, and expressive. Pandas are widely used in data science, machine learning, and data analysis for tasks such as data cleaning, transformation, and exploration.
What is the Pandas series?
A Pandas Series is a one-dimensional array-like object that can hold data of any type (integer, float, string, etc.). It is labelled, meaning each element has a unique identifier called an index. You can think of a Series as a column in a spreadsheet or a single column of a database table. Series are a fundamental data structure in Pandas and are commonly used for data manipulation and analysis tasks. They can be created from lists, arrays, dictionaries, and existing Series objects. Series are also a building block for the more complex Pandas DataFrame, which is a two-dimensional table-like structure consisting of multiple Series objects.
Creating a Series data structure from a list, dictionary, and custom index:
Python3
import pandas as pd
data = [ 1 , 2 , 3 , 4 , 5 ]
series_from_list = pd.Series(data)
print (series_from_list)
data = { 'a' : 1 , 'b' : 2 , 'c' : 3 }
series_from_dict = pd.Series(data)
print (series_from_dict)
data = [ 1 , 2 , 3 , 4 , 5 ]
index = [ 'a' , 'b' , 'c' , 'd' , 'e' ]
series_custom_index = pd.Series(data, index = index)
print (series_custom_index)
|
Output:
0 1
1 2
2 3
3 4
4 5
dtype: int64
a 1
b 2
c 3
dtype: int64
a 1
b 2
c 3
d 4
e 5
dtype: int64
Key Features of Series data structure:
Indexing:
Each element in a Series has a corresponding index, which can be used to access or manipulate the data.
Python3
print (series_from_list[ 0 ])
print (series_from_dict[ 'b' ])
|
Output:
1
2
Vectorized Operations:
Series supports vectorized operations, allowing you to perform arithmetic operations on the entire series efficiently.
Python3
series_a = pd.Series([ 1 , 2 , 3 ])
series_b = pd.Series([ 4 , 5 , 6 ])
sum_series = series_a + series_b
print (sum_series)
|
Output:
0 5
1 7
2 9
dtype: int64
Alignment:
When performing operations between two Series objects, Pandas automatically aligns the data based on the index labels.
Python3
series_a = pd.Series([ 1 , 2 , 3 ], index = [ 'a' , 'b' , 'c' ])
series_b = pd.Series([ 4 , 5 , 6 ], index = [ 'b' , 'c' , 'd' ])
sum_series = series_a + series_b
print (sum_series)
|
Output:
a NaN
b 6.0
c 8.0
d NaN
dtype: float64
NaN Handling:
Missing values, represented by NaN (Not a Number), can be handled gracefully in Series operations.
Python3
series_a = pd.Series([ 1 , 2 , 3 ], index = [ 'a' , 'b' , 'c' ])
series_b = pd.Series([ 4 , 5 ], index = [ 'b' , 'c' ])
sum_series = series_a + series_b
print (sum_series)
|
Output:
a NaN
b 6.0
c 8.0
dtype: float64
What is Pandas Dataframe?
A Pandas DataFrame is a two-dimensional, tabular data structure with rows and columns. It is similar to a spreadsheet or a table in a relational database. The DataFrame has three main components: the data, which is stored in rows and columns; the rows, which are labeled by an index; and the columns, which are labeled and contain the actual data.
Creating a dataframe from lists, dictionary
Python3
import pandas as pd
data = { 'Name' : [ 'John' , 'Alice' , 'Bob' ],
'Age' : [ 25 , 30 , 35 ],
'City' : [ 'New York' , 'Los Angeles' , 'Chicago' ]}
df = pd.DataFrame(data)
print (df)
data = [[ 'John' , 25 , 'New York' ],
[ 'Alice' , 30 , 'Los Angeles' ],
[ 'Bob' , 35 , 'Chicago' ]]
columns = [ 'Name' , 'Age' , 'City' ]
df = pd.DataFrame(data, columns = columns)
print (df)
|
Output:
Name Age City
0 John 25 New York
1 Alice 30 Los Angeles
2 Bob 35 Chicago
Name Age City
0 John 25 New York
1 Alice 30 Los Angeles
2 Bob 35 Chicago
Key Features of Data Frame data structures:
Indexing:
DataFrame provides flexible indexing options, allowing access to rows, columns, or individual elements based on labels or integer positions.
Python3
print (df[ 'Name' ])
print (df.loc[ 0 ])
print (df.iloc[ 0 ])
print (df.at[ 0 , 'Name' ])
|
Output:
0 John
1 Alice
2 Bob
Name: Name, dtype: object
Name John
Age 25
City New York
Name: 0, dtype: object
Name John
Age 25
City New York
Name: 0, dtype: object
John
Column Operations:
Columns in a DataFrame are Series objects, enabling various operations such as arithmetic operations, filtering, and sorting.
Python3
df[ 'Salary' ] = [ 50000 , 60000 , 70000 ]
high_salary_employees = df[df[ 'Salary' ] > 60000 ]
print (high_salary_employees)
sorted_df = df.sort_values(by = 'Age' , ascending = False )
print (sorted_df)
|
Output:
Name Age City Salary
2 Bob 35 Chicago 70000
Name Age City Salary
2 Bob 35 Chicago 70000
1 Alice 30 Los Angeles 60000
0 John 25 New York 50000
Missing Data Handling:
DataFrames provide methods for handling missing or NaN values, including dropping or filling missing values.
Python3
df.dropna()
print (df)
df.fillna( 0 )
print (df)
|
Output:
Name Age City Salary
0 John 25 New York 50000
1 Alice 30 Los Angeles 60000
2 Bob 35 Chicago 70000
Name Age City Salary
0 John 25 New York 50000
1 Alice 30 Los Angeles 60000
2 Bob 35 Chicago 70000
Grouping and Aggregation:
DataFrames support group-by operations for summarizing data and applying aggregation functions.
Python3
avg_age_by_city = df.groupby( 'City' )[ 'Age' ].mean()
print (avg_age_by_city)
|
Output:
City
Chicago 35.0
Los Angeles 30.0
New York 25.0
Name: Age, dtype: float64
DataFrame vs Series
Series
|
DataFrame
|
One- dimensional
|
Two- dimensional
|
Series elements must be homogenous.
|
Can be heterogeneous.
|
Immutable(size cannot be changed).
|
Mutable(size can be changeable).
|
Element wise computations.
|
Column wise computations.
|
Functionality is less.
|
Functionality is more.
|
Alignment not supported.
|
Alignment is supported.
|
Conclusion
In conclusion, Pandas offers two vital data structures, Series and DataFrame, each tailored for specific data manipulation tasks. Series excel in handling one-dimensional labeled data with efficient indexing and vectorized operations, while DataFrames provide tabular data organization with versatile indexing, column operations, and robust handling of missing data. Understanding their differences is crucial for effective data analysis in Python.
Share your thoughts in the comments
Please Login to comment...