Open In App

Tools to Automate EDA

Exploratory Data Analysis (EDA) is a critical phase in the data analysis process, where analysts and data scientists examine and explore the characteristics and patterns within a dataset. In this article, We’ll learn how to automate this process with Python libraries.

Exploratory Data Analysis

EDA stands for Exploratory Data Analysis. With the help of various visualization methods and statistical tools, we analyze and visualize our data sets to discover any patterns, relationships, anomalies, and key insights that can help us in analysis or decision-making. It’s a comprehensive approach to understanding and summarizing the main characteristics of a dataset. We have three types of analysis. Let’s understand them one by one in-depth.

Univariate analysis

It involves the examination of a single variable in isolation like:



  1. Descriptive Statistics: We examine the mean, median, mode, and standard deviation of all the numerical columns of our data
  2. Data visualization: We Create visual representations of the individual data columns like histograms, box plots, bar charts, PDF or PMF (depending on the type of data), and summary data.
  3. Data Distribution Analysis: We Investigate the distribution of variables to understand their shape, skewness, and potential impact on analysis.
  4. Identify gaps in data: We identify issues with the data, such as missing values, outliers, or inconsistencies,

Bivariate analysis

It involves the examination of the relationship between two variables:

  1. Scatter Plots: Creating scatter plots to visualize the relationship between two numerical variables i.e how one affects the other
  2. Correlation Analysis: Calculating correlation coefficients (e.g., Pearson correlation) to quantify the strength and direction of the linear relationship between two numerical variables.
  3. Categorical Plots: Using visualizations like stacked bar charts or grouped bar charts to compare the distribution of a categorical variable across different levels of another variable.
  4. Contingency Tables: Constructing contingency tables to show the frequency distribution of two categorical variables.
  5. Heatmaps: Creating heatmaps to visualize the intensity of the relationship between two numerical variables.

Multivariate Analysis

Beyond univariate and bivariate analysis, EDA may also incorporate multivariate techniques when exploring relationships involving more than two variables like:

  1. Dimensionality reduction techniques like Principal Component Analysis (PCA)
  2. Clustering methods to group observations based on patterns in multiple variables.

If we perform EDA manually we generally need to write lot of code for each type of analysis. Automatic Exploratory Data Analysis (Auto EDA) refers to the use of pre built libraries in python to perform the initial stages of eda. This greatly reduces manual effort. With only few lines of code we can generate the detailed analysis and concise summary of the main characteristics of a dataset .

Advantages of Automating EDA

  1. By automating repetitive tasks, Automated EDA tools can save a significant amount of time compared to manually creating visualizations and summary statistics so that analyst/data scientist can devote more of their time on gaining insights from data.
  2. Automated EDA produces well-formatted reports that can be easily shared with team members, stakeholders, or collaborators. This enhances communication and collaboration.
  3. Some automated EDA tools offer interactive interfaces that allow users to explore data dynamically. This can be particularly useful for gaining a deeper understanding of the data.
  4. Automated eda can help analysts discover patterns, trends, and outliers in the data more effectively.

Python Libraries for Exploratory Data Analysis

There are many such libraries available. We will explore the most popular of them namely

  1. Ydata profiling (previously known as pandas profiling)
  2. AutoViz
  3. SweetViz
  4. Dataprep
  5. D-tale

Before exploring this libraries let us look at the dataset that we will use as an example to use this libraries.

Throughout this article, we will be using the Titanic dataset. The Titanic dataset is a well-known dataset in the field of data science and machine learning. It contains information about passengers on the Titanic, including whether they survived or not.




import pandas as pd
df = pd.read_csv("/content/train.csv")

1. Ydata-Profiling

The capabilities of ydata-profiling package are :

The report generated by the tool typically contains three additional sections:

Install the library

!pip install ydata-profiling

Implementation




import ydata_profiling as yp
profile = yp.ProfileReport(df)
profile.to_notebook_iframe()
profile.to_file('eda_report.html')

Output:

Output of data profiling

2. AutoViz

AutoViz provides a rapid visual snapshot of the data. It’s built on top of Matplotlib and Seaborn, and it can quickly generate various charts and graphs. It provides below visuzlation

Install the library

!pip install autoviz

Implementation

We use the auto viz library to generate graphs as below.

The ‘AV.AutoViz()’ method in AutoViz offers several customizable arguments for streamlined data visualization:




from autoviz import AutoViz_Class
%matplotlib inline
AV = AutoViz_Class()
  
filename = "/content/train.csv"
target_variable = "Survived"
  
dft = AV.AutoViz(
    filename,
    sep=",",
    depVar=target_variable,
    dfte=None,
    header=0,
    verbose=1,
    lowess=False,
    chart_format="svg",
    max_rows_analyzed=150000,
    max_cols_analyzed=30,
    save_plot_dir="/content/"
)

Output:

Output of Autoviz library

3. Sweetivz

The library is mainly known for visualizing target values and comparing datasets. It is good tool for comparing different dataset like the train and test or different parts of the same dataset like (dataset divided into two categories based on a categorical feature like gender)

Key features of this library are :

Install the library

!pip install sweetviz

Implementation

Let us use this library to compare two subsets of our data frame(male vs female).




import sweetviz as sv
  
feature_config = sv.FeatureConfig(skip="PassengerId", force_text=["Age"])
my_report = sv.compare_intra(df, df["Sex"] == "male", ["Male", "Female"], "Survived", feature_config)
my_report.show_notebook()
my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

Output:

Sweetiviz Output

4. Data Prep

Notable feature of DataPrep that makes this library standout is its visualizations has insight notes and are interactive unlike static visualization of data profiling. Other key features

NOTE: restart runtime session before executing the code

Install the library

!pip install dataprep

Implementation




from dataprep.eda import create_report
create_report(df)

Output:

DataPrep Output

5. D-Tale

The key feature of D_tale is that users can interact with the dtale interface to explore the dataset dynamically. The interface typically includes features like filtering, sorting, and visualizations, providing a user-friendly environment for data exploration.

NOTE: dtale link does not work in colab

run below code in local machine , ensure that df is loaded with train.csv before running code

Install the library

!pip install dtale

Implementation




import dtale
dtale.show(df)

Output :

D-Tale

Comparing Data Exploration Libraries

Each of the library discussed above brings something unique. Lets discuss them.

Conclusion

We saw some of the libraries that can automate the task of EDA in few lines of code. They are many more available . One must choose the library that best fits the needs and preferences. Keep in mind that while automation tools can provide a quick overview, manual inspection and domain knowledge are still essential for a thorough understanding of the data.


Article Tags :