How to Load a Massive File as small chunks in Pandas?
Pandas in flexible and easy to use open-source data analysis tool build on top of python which makes importing and visualizing data of different formats like .csv, .tsv, .txt and even .db files.
For the below examples we will be considering only .csv file but the process is similar for other file types. The method used to read CSV files is read_csv()
filepath_or_bufferstr : Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.
iteratorbool : default False Return TextFileReader object for iteration or getting chunks with get_chunk().
chunksize : int, optional Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.
The read_csv() method has many parameters but the one we are interested is chunksize. Technically the number of rows read at a time in a file by pandas is referred to as chunksize. Suppose If the chunksize is 100 then pandas will load the first 100 rows. The object returned is not a data frame but a TextFileReader which needs to be iterated to get the data.
Example 1: Loading massive amount of data normally.
In the below program we are going to use the toxicity classification dataset which has more than 10000 rows. This is not much but will suffice for our example.
First Lets load the dataset and check the different number of columns. This dataset has 8 columns.
Let’s get more insights about the type of data and number of rows in the dataset.
We have a total of 159571 non-null rows.
Example 2: Loading a massive amounts of data using chunksize argument.
Here we are creating a chunk of size 10000 by passing the chunksize parameter. The object returned is not a data frame but an iterator, to get the data will need to iterate through this object.
Now, calculating the number of chunks-
In the above example, each element/chunk returned has a size of 10000. Remember we had 159571. Hence, the number of chunks is 159571/10000 ~ 15 chunks, and the remaining 9571 examples form the 16th chunk.
The number of columns for each chunk is 8. Hence, chunking doesn’t affect the columns. Now that we understand how to use chunksize and obtain the data lets have a last visualization of the data, for visibility purposes, the chunk size is assigned to 10.