Handling Large Datasets in Python

Last Updated : 08 Apr, 2024

Handling large datasets is a common task in data analysis and modification. When working with large datasets, it’s important to use efficient techniques and tools to ensure optimal performance and avoid memory issues. In this article, we will see how we can handle large datasets in Python.

Handle Large Datasets in Python

To handle large datasets in Python, we can use the below techniques:

Reduce Memory Usage by Optimizing Data Types

By default, Pandas assigns data types that may not be memory-efficient. For numeric columns, consider downcasting to smaller types (e.g., int32 instead of int64, float32 instead of float64). For example, if a column holds values like 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, using int8 (8 bits) instead of int64 (64 bits) is sufficient. Similarly, converting object data types to categories can also save memory.

Python3

import pandas as pd

# Define the size of the dataset
num_rows = 1000000  # 1 million rows

# Example DataFrame with inefficient datatypes
data = {'A': [1, 2, 3, 4],
        'B': [5.0, 6.0, 7.0, 8.0]}
df = pd.DataFrame(data)

# Replicate the DataFrame to create a larger dataset
df_large = pd.concat([df] * (num_rows // len(df)), ignore_index=True)

# Check memory usage before conversion
print("Memory usage before conversion:")
print(df_large.memory_usage().sum())

# Convert to more memory-efficient datatypes
df_large['A'] = pd.to_numeric(df_large['A'], downcast='integer')
df_large['B'] = pd.to_numeric(df_large['B'], downcast='float')

# Typecasting
df_large['A'] = df_large['A'].astype('int32')
df_large['B'] = df_large['B'].astype('float32')

# Check memory usage after conversion
print("Memory usage after conversion:")
print(df_large.memory_usage().sum())

# Print type casting
print("\nType casting:")
print("Column 'A' dtype:", df_large['A'].dtype)
print("Column 'B' dtype:", df_large['B'].dtype)

Output

Memory usage before conversion:
16000128
Memory usage after conversion:
5000128

Split Data into Chunks

Use the chunksize parameter in pd.read_csv() to read the dataset in smaller chunks. Process each chunk iteratively to avoid loading the entire dataset into memory at once.

Python3

import pandas as pd

# Create sample DataFrame
data = {'A': range(10000),
        'B': range(10000)}

# Process data in chunks
chunk_size = 1000
for chunk in pd.DataFrame(data).groupby(pd.DataFrame(data).index // chunk_size):
    print(chunk)

Output

(0,        A    B
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
995  995  995
996  996  996
997  997  997
998  998  998
999  999  999

[1000 rows x 2 columns])
(1,          A     B
1000  1000  1000
1001  1001  1001
1002  1002  1002
1003  1003  1003
1004  1004  1004
...    ...   ...
1995  1995  1995
1996  1996  1996
1997  1997  1997
1998  1998  1998
1999  1999  1999

[1000 rows x 2 columns])
(2,          A     B
2000  2000  2000
2001  2001  2001
2002  2002  2002
2003  2003  2003
2004  2004  2004
...    ...   ...
2995  2995  2995
2996  2996  2996
2997  2997  2997
2998  2998  2998
2999  2999  2999

[1000 rows x 2 columns])
(3,          A     B
3000  3000  3000
3001  3001  3001
3002  3002  3002
3003  3003  3003
3004  3004  3004
...    ...   ...
3995  3995  3995
3996  3996  3996
3997  3997  3997
3998  3998  3998
3999  3999  3999

[1000 rows x 2 columns])
(4,          A     B
4000  4000  4000
4001  4001  4001
4002  4002  4002
4003  4003  4003
4004  4004  4004
...    ...   ...
4995  4995  4995
4996  4996  4996
4997  4997  4997
4998  4998  4998
4999  4999  4999

[1000 rows x 2 columns])
(5,          A     B
5000  5000  5000
5001  5001  5001
5002  5002  5002
5003  5003  5003
5004  5004  5004
...    ...   ...
5995  5995  5995
5996  5996  5996
5997  5997  5997
5998  5998  5998
5999  5999  5999

[1000 rows x 2 columns])
(6,          A     B
6000  6000  6000
6001  6001  6001
6002  6002  6002
6003  6003  6003
6004  6004  6004
...    ...   ...
6995  6995  6995
6996  6996  6996
6997  6997  6997
6998  6998  6998
6999  6999  6999

[1000 rows x 2 columns])
(7,          A     B
7000  7000  7000
7001  7001  7001
7002  7002  7002
7003  7003  7003
7004  7004  7004
...    ...   ...
7995  7995  7995
7996  7996  7996
7997  7997  7997
7998  7998  7998
7999  7999  7999

[1000 rows x 2 columns])
(8,          A     B
8000  8000  8000
8001  8001  8001
8002  8002  8002
8003  8003  8003
8004  8004  8004
...    ...   ...
8995  8995  8995
8996  8996  8996
8997  8997  8997
8998  8998  8998
8999  8999  8999

[1000 rows x 2 columns])
(9,          A     B
9000  9000  9000
9001  9001  9001
9002  9002  9002
9003  9003  9003
9004  9004  9004
...    ...   ...
9995  9995  9995
9996  9996  9996
9997  9997  9997
9998  9998  9998
9999  9999  9999

[1000 rows x 2 columns])

Use Dask for Parallel Computing

Dask is a parallel computing library that allows us to scale Pandas workflows to larger-than-memory datasets. Leverage parallel processing for efficient handling of big data.

Python3

import dask.dataframe as dd
import pandas as pd

# Create sample DataFrame
data = {'A': range(10000),
        'B': range(10000)}

df = pd.DataFrame(data)

# Load data using Dask
ddf = dd.from_pandas(df, npartitions=4)

# Perform parallelized operations
result = ddf.groupby('A').mean().compute()
print(result)

Output

           B
A           
0        0.0
1        1.0
2        2.0
3        3.0
4        4.0
...      ...
9995  9995.0
9996  9996.0
9997  9997.0
9998  9998.0
9999  9999.0

[10000 rows x 1 columns]

Conclusion

In conclusion, handling large datasets in Python involves using streaming techniques, lazy evaluation, parallel processing, and data compression to optimize performance and memory usage. These steps helps to efficiently process and analyze large datasets for data analysis and modification.

Suggest improvement

Handling Categorical Data in Python

Share your thoughts in the comments