Open In App

How to speed up Pandas with cuDF?

Last Updated : 26 Jan, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Pandas data frames in Python are extremely useful; they provide an easy and flexible way to deal with data and a large number of in-built functions to handle, analyze, and process the data. While Pandas data frames have a decent processing time, still in the case of computationally intensive operations, Pandas data frames tend to be slow, causing delays in data science and ML workflows. This limited speed of pandas data frames is because pandas work on CPUs that only have 8 cores. However, GPU acceleration of data science and machine learning workflows provides a solution to this problem and enhances the speed of operations at an impressive level.

cuDF

cuDF (CUDA DF) is a Python GPU data frame library that helps accelerate the loading, processing, and manipulating of massive data – thus, enabling users to perform computer-intensive operations fast. cuDF is based on an apache arrow columnar layout which we will discuss later. 

In order to shift from CPU to GPU, i.e. Pandas to cuDF, one doesn’t need to learn a new library from scratch. cuDF provides a Pandas-like API – making the shift from Pandas to cuDF quite simple for data scientists, analysts, and Machine Learning Engineers. Just like Pandas, cuDF offers two data structures: Series and Dataframe – most of the in-built functions are also available in cuDF with the same syntax.

CUDA/GPU requirements:

  • CUDA 11.0+
  • NVIDIA driver 450.80.02+
  • Pascal architecture or better (Compute Capability >=6.0)
  • Conda

cuDF can be installed with conda from the rapidsai channel:

# for CUDA 11.0
conda install -c rapidsai -c nvidia -c numba -c conda-forge \
   cudf=21.08 python=3.7 cudatoolkit=11.0

# or, for CUDA 11.2
conda install -c rapidsai -c nvidia -c numba -c conda-forge \
   cudf=21.08 python=3.7 cudatoolkit=11.2

Comparison between computational times of Pandas and cuDF

In order to analyze the time taken in both cases, let us try to load a huge dataset data.csv – first using pandas library and then using cuDF, and compare the computational time in both the cases.

In the following example, we have taken a massive dataset ‘Data.csv’ comprising 887379 Rows and 22 Columns. First, we will load the dataset using Pandas compute the time taken, then we will repeat the same using cuDF to load the same data set and compare the runtimes.

Using Pandas to load a Dataset:

Python3




# Loading the Dataset using Pandas Library (CPU Based)
import pandas as pd
import time
  
  
start = time.time()
df = pd.read_csv("Data.csv")
print("no. of rows in the dataset", df.shape[0])
print("no. of columns in the dataset", df.shape[1])
end = time.time()
print("CPU time= ", end-start)


Output:

no. of rows in the dataset 887379
no. of columns in the dataset 22
CPU time=  2.3006720542907715

The output of the above code uses Pandas to load Data.csv.

Using cuDF to load a Dataset:

Python3




# Loading the Dataset using Pandas Library (GPU Based)
import cudf
import time
  
start = time.time()
df = cudf.read_csv("../input/data-big/Data.csv")
print("no. of rows in the dataset", df.shape[0])
print("no. of columns in the dataset", df.shape[1])
end = time.time()
print("GPU time= ", end-start)


Output:

no. of rows in the dataset 887379
no. of columns in the dataset 22
GPU time=  0.1478710174560547

The output of the above code uses cuDF to load Data.csv.

From the above two cases, it can be seen that the CPU (Pandas) takes 2.3006720542907715 seconds to load the dataset while GPU (cuDF) takes only 0.1478710174560547 seconds which is much faster.

Arrow Columnar Layout in cuDF

As stated earlier, cuDF employs Apache Arrow Columnar Layout, an in-memory columnar format used to represent structured datasets. This columnar format is fast and allows computational intensive operations to work with maximum efficiency while handling and iterating big datasets.

The following represents a sample dataset in Traditional Memory Buffer and Arrow Memory Buffer (Columnar Layout).

Traditional memory Buffer Vs Arrow Memory Buffer 

Traditional Memory Buffer the data is stored in contiguous memory locations row-wise. In contrast, in the case of Arrow Memory Buffer, the data is stored in contiguous memory locations column-wise. This is one of the contributing factors towards accelerating the speed of cuDF data frames.

Note: Since cuDF requires you to have specific RAPIDS compatible GPUs, for the sake of practice/exploring one can use Kaggle or Google Colaboratory as both these platforms provide free GPU access. However, while using Google Colabs just ensure that you’ve been allocated either of the following GPUs: Tesla T4, P4, or P100 as these are the only RAPIDS compatible GPUs on Google Colab.

Thus, it is evident that using cuDF we can employ GPU acceleration on Python data frames and make the processing of data quite fast. This holds immense significance in fields of data science and ML as data is being generated in overwhelming quantities every second – and its speedy processing is imperative. 



Similar Reads

Speed up Algorithms in Pytorch
PyTorch is a powerful open-source machine learning framework that allows you to develop and train deep learning models. However, as the size and complexity of your models grow, the time it takes to train them can become prohibitive. In this article, we will explore some techniques to speed up the algorithms in PyTorch. 1. Use GPU for Computation On
5 min read
Python | pandas.to_markdown() in Pandas
With the help of pandas.to_markdown() method, we can get the markdown table from the given dataframes by using pandas.to_markdown() method. Syntax : pandas.to_markdown() Return : Return the markdown table. Example #1 : In this example we can see that by using pandas.to_markdown() method, we are able to get the markdown table from the given datafram
1 min read
Add a Pandas series to another Pandas series
Let us see how to add a Pandas series to another series in Python. This can be done using 2 ways: append()concat() Method 1: Using the append() function: It appends one series object at the end of another series object and returns an appended series. The attribute, ignore_index=True is used when we do not use index values on appending, i.e., the re
2 min read
Python Pandas - pandas.api.types.is_file_like() Function
In this article, we will be looking toward the functionality of pandas.api.types.is_file_like() from the pandas.api.types module with its various examples in the Python language. An object must be an iterator AND have a read or write method as an attribute to be called file-like. It is important to note that file-like objects must be iterable, but
2 min read
Pandas DataFrame hist() Method | Create Histogram in Pandas
A histogram is a graphical representation of the numerical data. Sometimes you'll want to share data insights with someone, and using graphical representations has become the industry standard. Pandas.DataFrame.hist() function plots the histogram of a given Data frame. It is useful in understanding the distribution of numeric variables. This functi
4 min read
Pandas DataFrame iterrows() Method | Pandas Method
Pandas DataFrame iterrows() iterates over a Pandas DataFrame rows in the form of (index, series) pair. This function iterates over the data frame column, it will return a tuple with the column name and content in the form of a series. Example: Python Code import pandas as pd df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 32, 3
2 min read
Pandas DataFrame interpolate() Method | Pandas Method
Python is a great language for data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.  Python Pandas interpolate() method is used to fill NaN values in the DataFrame or Series using various interpolation techniques to fill the m
3 min read
Pandas DataFrame duplicated() Method | Pandas Method
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas duplicated() method identifies duplicated rows in a DataFrame. It returns a boolean series which is True only for unique rows. Ex
3 min read
Pandas Series dt.day_name() Method | Get Day From Date in Pandas
Pandas dt.day_name() method returns the day names of the DateTime Series objects with specified locale. Example C/C++ Code import pandas as pd sr = pd.Series(['2012-12-31 08:45', '2019-1-1 12:30', '2008-02-2 10:30', '2010-1-1 09:25', '2019-12-31 00:00']) idx = ['Day 1', 'Day 2', 'Day 3', 'Day 4', 'Day 5'] sr.index = idx sr = pd.to_datetime(sr) resu
2 min read
Pandas Series dt.weekday | Find Day of the Week in Pandas
The dt.weekday attribute returns the day of the week. It is assumed the week starts on Monday, which is denoted by 0, and ends on Sunday which is denoted by 6. Example C/C++ Code import pandas as pd sr = pd.Series(['2012-10-21 09:30', '2019-7-18 12:30', '2008-02-2 10:30', '2010-4-22 09:25', '2019-11-8 02:22']) idx = ['Day 1', 'Day 2', 'Day 3', 'Day
2 min read