We spend a lot of time editing, cleaning, and analyzing data using various methodologies in today’s data-driven environment. Pandas is a well-known Python module that aids with data manipulation. It keeps data in structures known as dataframes and enables you to alter, clean up, or analyze data by carrying out various operations like generating a bar graph for the dataframe, adding a new row or column, or replacing some missing data. These duties frequently require a lot of time, which could be spent on other things. We now have PandasAI, a pandas library extension that can aid in more efficient data analysis and manipulation.
What is PandasAI?
Pandas AI is an extension to the pandas library using OpenAI’s generative AI models. It allows you to generate insights from your dataframe using just a text prompt. It works on the text-to-query generative AI developed by OpenAI. Data Scientists and data analysts spend a lot of time preparing the data for analysis. They can now move forward with their data analysis. Pandas AI now makes it possible for data experts to use many of the strategies and procedures they have researched to reduce the time required for data preparation. PandasAI should not be used in place of Pandas; rather, it should be utilized in addition to Pandas. You can pose these queries to PandasAI, and it will provide responses in the form of Pandas DataFrames, saving you the time of having to manually browse and respond to queries about the dataset. With the use of the OpenAI API, Pandas AI aims to achieve the goal of allowing you to virtually converse with a machine that will then provide the desired outcomes rather than having to program the task yourself. The outcome will be generated by the machine as machine-readable code (DataFrame), which is the language they use.
How to use PandasAI?
Step 1: Install pandasai and openai library
!pip install -q pandasai openai
Step 2: Import the necessary libraries
Python3
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
|
Step 3: Load the Dataset into a dataframe using a dictionary
Python3
dataframe = {
"country" : [
"Delhi" ,
"Mumbai" ,
"Kolkata" ,
"Chennai" ,
"Jaipur" ,
"Lucknow" ,
"Pune" ,
"Bengaluru" ,
"Amritsar" ,
"Agra" ,
],
"annual tax collected" : [
19294482072 ,
28916155672 ,
24112550372 ,
34358173362 ,
17454337886 ,
11812051350 ,
16074023894 ,
14909678554 ,
43807565410 ,
146318441864 ,
],
"happiness_index" : [ 9.94 , 7.16 , 6.35 , 8.07 , 6.98 , 6.1 , 4.23 , 8.22 , 6.87 , 3.36 ],
}
df = pd.DataFrame(dataframe)
df.head()
|
Output:
Note: You can also read data from CSV file using the command pd.read_csv(“file_location”).
Step 4: Initialize an Open AI Large-Language Model (LLM)
Since PandasAI works on OpenAI LLM, we need to store OpenAI API key in the environment using the following code:
Python3
llm = OpenAI(api_token = 'YOUR_API_KEY' )
pandas_ai = PandasAI(llm, verbose = True , conversational = False )
|
If you do not have an OpenAI API key, you can create an account on OpenAI platform and generate a new API key here. Now we are all set to use our Generative model to generate insights or clean data using Pandas AI.
Step 5: Provide a text prompt and dataframe to PandaAI
Python3
PROMPT = "YOUR_TEXT_PROMPT"
response = pandas_ai(df, PROMPT)
print (response)
|
Automate Pandas operations with pandasai
Now Let’s try some prompts on our custom dataset
Prompt 1: Performing sum operation
Python3
response = pandas_ai(df, "Calculate the total tax collected in north Indian cities" )
print (response)
|
Output:
Running PandasAI with openai LLM...
Code generated:
```
import pandas as pd
# Creating the dataframe
data = {'country': ['Mumbai', 'Jaipur', 'Kolkata', 'Delhi', 'Chennai', 'Lucknow', 'Hyderabad', 'Ahmedabad', 'Bangalore', 'Pune'],
'annual tax collected': [3274294604, 7159422858, 8155677164, 3688595185, 3679908367, 4567890123, 2345678901, 3456789012, 5678901234, 6789012345],
'happiness_index': [6.98, 8.07, 8.07, 6.35, 7.16, 7.89, 7.45, 7.12, 8.56, 7.23]}
df = pd.DataFrame(data)
# Filtering north Indian cities
north_cities = ['Jaipur', 'Kolkata', 'Delhi', 'Lucknow', 'Ahmedabad']
north_df = df[df['country'].isin(north_cities)]
# Calculating total tax collected in north Indian cities
total_tax_collected = north_df['annual tax collected'].sum()
print(total_tax_collected)
```
Code running:
```
data = {'country': ['Mumbai', 'Jaipur', 'Kolkata', 'Delhi', 'Chennai',
'Lucknow', 'Hyderabad', 'Ahmedabad', 'Bangalore', 'Pune'],
'annual tax collected': [3274294604, 7159422858, 8155677164, 3688595185,
3679908367, 4567890123, 2345678901, 3456789012, 5678901234, 6789012345],
'happiness_index': [6.98, 8.07, 8.07, 6.35, 7.16, 7.89, 7.45, 7.12,
8.56, 7.23]}
north_cities = ['Jaipur', 'Kolkata', 'Delhi', 'Lucknow', 'Ahmedabad']
north_df = df[df['country'].isin(north_cities)]
total_tax_collected = north_df['annual tax collected'].sum()
print(total_tax_collected)
```
Answer: 72673421680
Prompt 2: Analyzing the dataset
Python3
response = pandas_ai.run(df, prompt = 'Which are the 5 happiest cities?' )
print (response)
|
Output:
Running PandasAI with openai LLM...
Code generated:
```
import pandas as pd
# Creating the dataframe
data = {'country': ['Kolkata', 'Jaipur', 'Delhi', 'Mumbai', 'Chennai', 'Bangalore', 'Hyderabad', 'Pune', 'Ahmedabad', 'Surat'],
'annual tax collected': [3560469532, 9597107067, 4821092001, 9053727452, 3738210455, 6489321000, 5183920000, 2874610000, 3958200000, 3129400000],
'happiness_index': [6.35, 8.07, 7.16, 6.98, 6.98, 7.89, 7.45, 7.12, 6.78, 6.55]}
df = pd.DataFrame(data)
# Sorting the dataframe by happiness index in descending order
df = df.sort_values(by='happiness_index', ascending=False)
# Selecting the top 5 happiest cities
top_5_happiest_cities = df.head(5)['country'].tolist()
print(top_5_happiest_cities)
```
Code running:
```
data = {'country': ['Kolkata', 'Jaipur', 'Delhi', 'Mumbai', 'Chennai',
'Bangalore', 'Hyderabad', 'Pune', 'Ahmedabad', 'Surat'],
'annual tax collected': [3560469532, 9597107067, 4821092001, 9053727452,
3738210455, 6489321000, 5183920000, 2874610000, 3958200000, 3129400000],
'happiness_index': [6.35, 8.07, 7.16, 6.98, 6.98, 7.89, 7.45, 7.12,
6.78, 6.55]}
top_5_happiest_cities = df.head(5)['country'].tolist()
print(top_5_happiest_cities)
```
Answer: ['Delhi', 'Mumbai', 'Kolkata', 'Chennai', 'Jaipur']
Prompt 3: Performing sort operation
Python3
response = pandas_ai.run(df,
prompt =
)
print (response)
|
Output:
Running PandasAI with openai LLM...
Code generated:
```
df_sorted = df.sort_values(by='happiness_index', ascending=True)
print(df_sorted)
```
Code running:
```
df_sorted = df.sort_values(by='happiness_index', ascending=True)
print(df_sorted)
```
Answer: country annual tax collected happiness_index
9 Agra 146318441864 3.36
6 Pune 16074023894 4.23
5 Lucknow 11812051350 6.10
2 Kolkata 24112550372 6.35
8 Amritsar 43807565410 6.87
4 Jaipur 17454337886 6.98
1 Mumbai 28916155672 7.16
3 Chennai 34358173362 8.07
7 Bengaluru 14909678554 8.22
0 Delhi 19294482072 9.94
Prompt 4: Plotting a histogram
Python3
PROMPT =
response = pandas_ai.run(df, prompt = PROMPT)
print (response)
|
Output:
Running PandasAI with openai LLM…
histogram representing tax collected in north indian cities
Code generated:
``` north_cities = ['Delhi', 'Jaipur']
north_df = df[df['country'].isin(north_cities)]
import matplotlib.pyplot as plt plt.bar(north_df['country'],
north_df['annual tax collected'])
plt.xlabel('Indian cities')
plt.ylabel('Tax collected')
plt.title('Tax collection of north Indian cities')
plt.show()
``` Code running:
``` north_cities = ['Delhi', 'Jaipur']
north_df = df[df['country'].isin(north_cities)]
plt.bar(north_df['country'], north_df['annual tax collected'])
plt.xlabel('Indian cities')
plt.ylabel('Tax collected')
plt.title('Tax collection of north Indian cities')
plt.show()
``` Answer:
Prompt 5: Performing groupby operation
Python3
PROMPT = "Group the cities in the dataset according to their happiness index"
response = pandas_ai.run(df, prompt = PROMPT)
print (response)
|
Output:
Running PandasAI with openai LLM...
Code generated:
```
# Import pandas library
import pandas as pd
# Create the dataframe
data = {'country': ['Chennai', 'Delhi', 'Mumbai', 'Kolkata', 'Jaipur', 'Bangalore', 'Hyderabad', 'Pune', 'Ahmedabad', 'Surat'],
'annual tax collected': [4115278226, 5211175683, 9898166675, 2429903829, 2640456722, 6329812345, 4781234567, 3214567890, 5678901234, 4321987654],
'happiness_index': [6.98, 6.98, 8.07, 9.94, 7.16, 8.56, 7.89, 6.78, 8.23, 7.45]}
df = pd.DataFrame(data)
# Group the cities by their happiness index
grouped_df = df.groupby('happiness_index')
# Print the groups
for name, group in grouped_df:
print("Happiness Index:", name)
print(group)
```
Code running:
```
data = {'country': ['Chennai', 'Delhi', 'Mumbai', 'Kolkata', 'Jaipur',
'Bangalore', 'Hyderabad', 'Pune', 'Ahmedabad', 'Surat'],
'annual tax collected': [4115278226, 5211175683, 9898166675, 2429903829,
2640456722, 6329812345, 4781234567, 3214567890, 5678901234, 4321987654],
'happiness_index': [6.98, 6.98, 8.07, 9.94, 7.16, 8.56, 7.89, 6.78,
8.23, 7.45]}
grouped_df = df.groupby('happiness_index')
for name, group in grouped_df:
print('Happiness Index:', name)
print(group)
```
Answer: Happiness Index: 3.36
country annual tax collected happiness_index
9 Agra 146318441864 3.36
Happiness Index: 4.23
country annual tax collected happiness_index
6 Pune 16074023894 4.23
Happiness Index: 6.1
country annual tax collected happiness_index
5 Lucknow 11812051350 6.1
Happiness Index: 6.35
country annual tax collected happiness_index
2 Kolkata 24112550372 6.35
Happiness Index: 6.87
country annual tax collected happiness_index
8 Amritsar 43807565410 6.87
Happiness Index: 6.98
country annual tax collected happiness_index
4 Jaipur 17454337886 6.98
Happiness Index: 7.16
country annual tax collected happiness_index
1 Mumbai 28916155672 7.16
Happiness Index: 8.07
country annual tax collected happiness_index
3 Chennai 34358173362 8.07
Happiness Index: 8.22
country annual tax collected happiness_index
7 Bengaluru 14909678554 8.22
Happiness Index: 9.94
Prompt 6: Describe the dataset
Python3
PROMPT = "Give statistical information about dataset"
response = pandas_ai.run(df, prompt = PROMPT)
print (response)
|
Output:
Running PandasAI with openai LLM...
Code generated:
```
import pandas as pd
# create the dataframe
data = {'country': ['Delhi', 'Chennai', 'Kolkata', 'Mumbai', 'Jaipur'],
'annual tax collected': [6851245018, 5569913156, 497203726, 1780282822, 856852833],
'happiness_index': [6.98, 6.35, 6.35, 6.98, 7.16]}
df = pd.DataFrame(data)
# describe the dataframe
print(df.describe())
```
Code running:
```
data = {'country': ['Delhi', 'Chennai', 'Kolkata', 'Mumbai', 'Jaipur'],
'annual tax collected': [6851245018, 5569913156, 497203726, 1780282822,
856852833], 'happiness_index': [6.98, 6.35, 6.35, 6.98, 7.16]}
print(df.describe())
```
Answer: annual tax collected happiness_index
count 1.000000e+01 10.000000
mean 3.570575e+10 6.728000
std 4.010314e+10 1.907149
min 1.181205e+10 3.360000
25% 1.641910e+10 6.162500
50% 2.170352e+10 6.925000
75% 3.299767e+10 7.842500
max 1.463184e+11 9.940000
Prompt 7: Check for missing values
Python3
PROMPT = "Are there any missing values in the dataset"
response = pandas_ai.run(df, prompt = PROMPT)
print (response)
|
Output:
Running PandasAI with openai LLM...
Code generated:
```
import pandas as pd
# Creating the dataframe
data = {'country': ['Jaipur', 'Delhi', 'Mumbai', 'Chennai', 'Kolkata'],
'annual tax collected': [8203131465, 406012666, 6195812866, 8532100009, 2405598967],
'happiness_index': [8.07, 6.98, 6.98, 7.16, 6.35]}
df = pd.DataFrame(data)
# Checking for missing values
print(df.isnull().values.any())
```
Code running:
```
data = {'country': ['Jaipur', 'Delhi', 'Mumbai', 'Chennai', 'Kolkata'],
'annual tax collected': [8203131465, 406012666, 6195812866, 8532100009,
2405598967], 'happiness_index': [8.07, 6.98, 6.98, 7.16, 6.35]}
print(df.isnull().values.any())
```
Answer: False
To learn more about Chat GPT, you can refer to:
Frequently Asked Questions (FAQs)
1. What is Pandas?
Pandas is a robust Python framework for handling and analysing data. It offers data structures and operations to manage structured data effectively, including tabular or time series data.
2. How is PandasAI different from Pandas?
In Pandas, you have to perform operations on dataset by typing many lines of code which can be quite time consuming. PandasAI makes this task easier by performing operations on datasets from simple text prompts. It leverages the power of OpenAI LLMs and generates code to perform specified operations on provided datasets.
3. Is Pandas AI suitable for big data analysis?
While PandasAI comes in handy to perform many tasks by simply using a text prompt. It still cannot be used for big data analysis as it still uses dataframe as it’s data structure which has a large overhead causing RAM issues for big data.
4. Where can I find resources to learn PandasAI?
You can refer to the official github repository at https://github.com/gventuri/pandas-ai.
Conclusion
In this article, we looked at PandasAI’s advantages as a useful addition for pandas library users. PandasAI has several amazing capabilities, such as running language prompts that resemble SQL searches and producing visualizations directly from a DataFrame. It without a doubt increases productivity by automating several processes. It’s crucial to remember that even if PandasAI is a strong tool, the Pandas library still needs to be used. The pandas library’s capabilities are still necessary for some sophisticated operations, such as adding missing data to a DataFrame. Pandas’ extensive ecosystem and wide range of features continue to be crucial for managing challenging data manipulation and analysis tasks. Consequently, PandasAI is a useful addition that enhances the functionality of the pandas library and further augments the efficiency and convenience of working with data in Python.
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...