Tips to Avoid memory Error in Very Large Dataset

A NumPy Memory Error occurs when the library cannot allocate sufficient memory to perform a requested operation. This can happen due to various reasons, such as insufficient physical RAM, inefficient memory management, or processing excessively large datasets. This error typically occurs when we try to allocate more memory than our system can provide.

Before begging, we need to know the following terms:

NumPy: NumPy is a powerful library for numerical computing in Python. It provides efficient data structures and functions for working with large arrays and matrices.
Memory Error: A memory error occurs when our program tries to allocate more memory than is available in the system.
Memory Consumption: Memory consumption refers to the amount of memory used by our program to store data, including variables, arrays, and other objects.

Approaches to deal with the Memory Error generated by large Numpy (Python) arrays:

To resolve NumPy’s Memory Error, consider the following approaches:

Optimize Array Creation:

Instead of creating a large array at once, consider creating it incrementally or using generator expressions to conserve memory.

The code you provided creates a large NumPy array incrementally by appending smaller chunks of random numbers. This approach can be inefficient for large arrays because np.append() creates a new array every time it's called, resulting in significant memory overhead and slow performance due to memory reallocation.

Python3

import numpy as np

# Example: Create a large array incrementally
size = 100000000  # 100 million elements
increment = 10000
large_array = np.array([], dtype=np.float64)
for i in range(0, size, increment):
    chunk = np.random.rand(increment)
    large_array = np.append(large_array, chunk)
print("Array created successfully with size:", large_array.size)

Output:

Array created successfully with size: 10000000

Use Chunking:

Break down large arrays into smaller chunks and process them iteratively to avoid memory overload.

Processing large arrays in chunks like this can be more memory-efficient and can also facilitate parallel or distributed processing, especially when dealing with extremely large datasets that don't fit entirely into memory. It allows you to process manageable portions of the data at a time, rather than loading the entire dataset into memory at once.

Python3

import numpy as np

# Example: Process large array in chunks
large_array = np.random.rand(10000000)  # Large array of 10 million elements
chunk_size = 10000
num_chunks = len(large_array) // chunk_size
for i in range(num_chunks):
    chunk = large_array[i * chunk_size: (i + 1) * chunk_size]
    # Process chunk here
    print("Processed chunk", i)

Output:

Processed chunk 0
Processed chunk 1
Processed chunk 2
Processed chunk 3
.
.
.
Processed chunk 994
Processed chunk 995
Processed chunk 996
Processed chunk 997
Processed chunk 998
Processed chunk 999

Free Memory:

Release memory occupied by unused variables or objects using the del keyword or by setting them to None.

Python3

import numpy as np

# Example: Free memory occupied by unused variables
large_array = np.random.rand(100000000)  # Large array of 100 million elements
# Process large_array
del large_array  # Free memory

Utilize Virtual Memory:

Memory-map large arrays to disk using libraries like numpy.memmap to access only the required portions of data into memory.

Python3

import numpy as np

# Example: Memory-mapped array
filename = 'large_array.npy'
large_array_mmapped = np.memmap(filename, dtype='float32', mode='w+', shape=(100000000,))
large_array_mmapped[:] = np.random.rand(100000000)  # Writing data to disk
del large_array_mmapped  # Freeing memory
large_array_mmapped = np.memmap(filename, dtype='float32', mode='r', shape=(100000000,))
print(large_array_mmapped[:10])  # Accessing data from disk

Output:

[0.709748   0.99464947 0.3146733  0.8145548  0.87799954 0.29239368 0.36480942 0.8335829  0.7952584  0.34854943]

Article Tags :

AI-ML-DS

Numpy

AI-ML-DS With Python