Open In App

Bypassing Pandas Memory Limitations

Improve
Improve
Like Article
Like
Save
Share
Report

Pandas is a Python library used for analyzing and manipulating data sets but one of the major drawbacks of Pandas is memory limitation issues while working with large datasets since Pandas DataFrames (two-dimensional data structure) are kept in memory, there is a limit to how much data can be processed at a time.

Dataset in use: train_dataset

Processing large amounts of data in Pandas requires one of the below approaches:

Method 1: Load data in chunks

pandas.read_csv() has a parameter called chunksize which is used to load data in chunks. The parameter chunksize is the number of rows read at a time in a file by Pandas. It returns an iterator TextFileReader which needs to be iterated to get the data.

Syntax:

    pd.read_csv(‘file_name’, chunksize= size_of_chunk)

Example:

Python3




import pandas as pd
  
data = pd.read_csv('train dataset.csv', chunksize=100)
  
for x in data:
    print(x.shape[0])


Output:

Method 2: Filter out useful data 

Large datasets have many columns/features but only some of them are actually used. So to save more time for data manipulation and computation, load only useful columns.

Syntax:

dataframe = dataframe[[‘column_1’, ‘column_2’, ‘column_3’, ‘column_4’, ‘column_5’]]

Example : 

Python3




import pandas as pd
  
data=pd.read_csv('train_dataset.csv')
  
data = data[['Gender', 'Age', 'openness', 'neuroticism',
             'conscientiousness', 'agreeableness', 'extraversion']]
  
display(data)


Output : 

Method 3: Specify dtypes for columns 

By default, pandas assigns int64 range(which is the largest available dtype) for all numeric values. But if the values in the numeric column are less than int64 range, then lesser capacity dtypes can be used to prevent extra memory allocation as larger dtypes use more memory.

Syntax:

  dataframe =pd.read_csv(‘file_name’,dtype={‘col_1’:‘dtype_value’,‘col_2’:‘dtype_value’})

Example : 

Python3




import pandas as pd
  
data = pd.read_csv('train_dataset.csv', dtype={'Age': 'int32'})
  
print(data.info())


Output :

Method 4: Sparse data structures 

Pandas Dataframe can be converted to Sparse Dataframe which means that any data matching a specific value is omitted in the representation. The sparse DataFrame allows for more efficient storage.

Syntax:

    dataframe = dataFrame.to_sparse(fill_value=None, kind=’block’)

Since there are no null values in the above dataset, let’s create dataframe with some null values and convert it to a sparse dataframe.

Example :

Python3




import pandas as pd
import numpy as np
  
df = pd.DataFrame(np.random.randn(10000, 4))
df.iloc[:9998] = np.nan
  
sdf = df.astype(pd.SparseDtype("float", np.nan))
  
sdf.head()
  
sdf.dtypes


Output:

Method 5: Delete unused objects 

While data cleaning/pre-processing many temporary data frames and objects are created which should be deleted after their use so that less memory is used. The del keyword in python is primarily used to delete objects in Python.

Syntax:

    del object_name

Example : 

Python3




import pandas as pd
  
data = pd.read_csv('train_dataset.csv')
  
del data




Last Updated : 30 Apr, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads