Open In App

Uber Rides Data Analysis using Python

Last Updated : 14 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will use Python and its different libraries to analyze the Uber Rides Data.

Importing Libraries

The analysis will be done using the following libraries : 

  • Pandas:  This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
  • Numpy: Numpy arrays are very fast and can perform large computations in a very short time.
  • Matplotlib / Seaborn: This library is used to draw visualizations.

To importing all these libraries, we can use the  below code :

Python3




import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


Importing Dataset

After importing all the libraries,  download the data using the link.

Once downloaded, you can import the dataset using the pandas library.

Python3




dataset = pd.read_csv("UberDataset.csv")
dataset.head()


Output : 

Uber Rides Data Analysis using Python

 

To find the shape of the dataset, we can use dataset.shape

Python3




dataset.shape


Output : 

(1156, 7)

To understand the data more deeply, we need to know about the null values count, datatype, etc. So for that we will use the below code.

Python3




dataset.info()


Output : 

Uber Rides Data Analysis using Python

 

Data Preprocessing

As we understood that there are a lot of null values in PURPOSE column, so for that we will me filling the null values with a NOT keyword. You can try something else too.

Python3




dataset['PURPOSE'].fillna("NOT", inplace=True)


Changing the START_DATE and END_DATE to the date_time format so that further it can be use to do analysis.

Python3




dataset['START_DATE'] = pd.to_datetime(dataset['START_DATE'],
                                       errors='coerce')
dataset['END_DATE'] = pd.to_datetime(dataset['END_DATE'],
                                     errors='coerce')


Splitting the START_DATE to date and time column and then converting the time into four different categories i.e. Morning, Afternoon, Evening, Night

Python3




from datetime import datetime
 
dataset['date'] = pd.DatetimeIndex(dataset['START_DATE']).date
dataset['time'] = pd.DatetimeIndex(dataset['START_DATE']).hour
 
#changing into categories of day and night
dataset['day-night'] = pd.cut(x=dataset['time'],
                              bins = [0,10,15,19,24],
                              labels = ['Morning','Afternoon','Evening','Night'])


Once we are done with creating new columns, we can now drop rows with null values.

Python3




dataset.dropna(inplace=True)


It is also important to drop the duplicates rows from the dataset. To do that, refer the code below.

Python3




dataset.drop_duplicates(inplace=True)


Data Visualization

In this section, we will try to understand and compare all columns.

Let’s start with checking the unique values in dataset of the columns with object datatype.

Python3




obj = (dataset.dtypes == 'object')
object_cols = list(obj[obj].index)
 
unique_values = {}
for col in object_cols:
  unique_values[col] = dataset[col].unique().size
unique_values


Output : 

{'CATEGORY': 2, 'START': 177, 'STOP': 188, 'PURPOSE': 11, 'date': 294}

Now, we will be using matplotlib and seaborn library for countplot the CATEGORY and PURPOSE columns.

Python3




plt.figure(figsize=(10,5))
 
plt.subplot(1,2,1)
sns.countplot(dataset['CATEGORY'])
plt.xticks(rotation=90)
 
plt.subplot(1,2,2)
sns.countplot(dataset['PURPOSE'])
plt.xticks(rotation=90)


Output : 

Uber Rides Data Analysis using Python

 

Let’s do the same for time column, here we will be using the time column which we have extracted above.

Python3




sns.countplot(dataset['day-night'])
plt.xticks(rotation=90)


Output : 

Uber Rides Data Analysis using Python

 

Now, we will be comparing the two different categories along with the PURPOSE of the user.

Python3




plt.figure(figsize=(15, 5))
sns.countplot(data=dataset, x='PURPOSE', hue='CATEGORY')
plt.xticks(rotation=90)
plt.show()


Output : 

Uber Rides Data Analysis using Python

 

Insights from the above count-plots : 

  • Most of the rides are booked for business purpose.
  • Most of the people book cabs for Meetings and Meal / Entertain purpose.
  • Most of the cabs are booked in the time duration of 10am-5pm (Afternoon).

As we have seen that CATEGORY and PURPOSE columns are two very important columns. So now we will be using OneHotEncoder to categories them.

Python3




from sklearn.preprocessing import OneHotEncoder
object_cols = ['CATEGORY', 'PURPOSE']
OH_encoder = OneHotEncoder(sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(dataset[object_cols]))
OH_cols.index = dataset.index
OH_cols.columns = OH_encoder.get_feature_names()
df_final = dataset.drop(object_cols, axis=1)
dataset = pd.concat([df_final, OH_cols], axis=1)


After that, we can now find the correlation between the columns using heatmap.

Python3




plt.figure(figsize=(12, 6))
sns.heatmap(dataset.corr(),
            cmap='BrBG',
            fmt='.2f',
            linewidths=2,
            annot=True)


Output : 

Uber Rides Data Analysis using Python

 

Insights from the heatmap:

  • Business and Personal Category are highly negatively correlated, this have already proven earlier. So this plot, justifies the above conclusions.
  • There is not much correlation between the features.

Now, as we need to visualize the month data. This can we same as done before (for hours). 

Python3




dataset['MONTH'] = pd.DatetimeIndex(dataset['START_DATE']).month
month_label = {1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'April',
               5.0: 'May', 6.0: 'June', 7.0: 'July', 8.0: 'Aug',
               9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'}
dataset["MONTH"] = dataset.MONTH.map(month_label)
 
mon = dataset.MONTH.value_counts(sort=False)
 
# Month total rides count vs Month ride max count
df = pd.DataFrame({"MONTHS": mon.values,
                   "VALUE COUNT": dataset.groupby('MONTH',
                                                  sort=False)['MILES'].max()})
 
p = sns.lineplot(data=df)
p.set(xlabel="MONTHS", ylabel="VALUE COUNT")


Output :

Uber Rides Data Analysis using Python

 

Insights from the above plot : 

  • The counts are very irregular.
  • Still its very clear that the counts are very less during Nov, Dec, Jan, which justifies the fact that  time winters are there in Florida, US.

Visualization for days data.

Python3




dataset['DAY'] = dataset.START_DATE.dt.weekday
day_label = {
    0: 'Mon', 1: 'Tues', 2: 'Wed', 3: 'Thus', 4: 'Fri', 5: 'Sat', 6: 'Sun'
}
dataset['DAY'] = dataset['DAY'].map(day_label)


Python3




day_label = dataset.DAY.value_counts()
sns.barplot(x=day_label.index, y=day_label);
plt.xlabel('DAY')
plt.ylabel('COUNT')


Output :

Uber Rides Data Analysis using Python

 

Now, let’s explore the MILES Column .

We can use boxplot to check the distribution of the column.

Python3




sns.boxplot(dataset['MILES'])


Output :

Uber Rides Data Analysis using Python

 

As the graph is not clearly understandable. Let’s zoom in it for values lees than 100.

Python3




sns.boxplot(dataset[dataset['MILES']<100]['MILES'])


Output :

Uber Rides Data Analysis using Python

 

It’s bit visible. But to get more clarity we can use distplot for values less than 40.

Python3




sns.distplot(dataset[dataset['MILES']<40]['MILES'])


Output :

Uber Rides Data Analysis using Python

 

Insights from the above plots :

  • Most of the cabs booked for the distance of 4-5 miles.
  • Majorly people chooses cabs for the distance of 0-20 miles.
  • For distance more than 20 miles cab counts is nearly negligible.


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads