How to Fix Python Pandas Error Tokenizing Data

Last Updated : 13 Oct, 2023

The Python library used to analyze data is known as Pandas. The most common way of reading data in Pandas is through the CSV file, but the limitation with the CSV file is it should be in a specific format, or else it will throw an error in tokenizing data. In this article, we will discuss the various ways to fix Python Pandas Error Tokenizing data.

What is Python Pandas Error Tokenizing Data?

The “Python Pandas Error Tokenizing Data” typically occurs when you are using the pandas.read_csv() function to read data from a CSV file, and the function encounters issues with tokenizing or parsing the data. Tokenization refers to the process of splitting the data into smaller units (tokens), usually based on a delimiter, in the case of CSV files, it’s typically a comma.

Fixing Python Pandas Error Tokenizing Data

Check the CSV file
Specify the delimiter
Use the correct encoding
Skip rows with errors
Fix unbalanced quotes

Check the CSV file

As we are reading Python Pandas data through the CSV file, it is crucial to check if the CSV file we are uploading has any errors or not. To check if the CSV file has any errors or not, you can open the CSV file through any Excel or any of your favorite editors. In case, you find any error, correct the error and upload the correct CSV again.

Screenshot-2023-09-30-113711

Specify the Delimiter

The default delimiter used while reading the CSV file in Pandas data frame is comma ( , ). In case, you are using any other delimiter in the CSV file, then it’s necessary to specify that delimiter while reading of CSV file, else it will read the CSV file wrong or give the error tokenizing data. You can specify the delimiter while reading the CSV as follows:

Example: In this example, we are reading the CSV file which has data separated by semicolon, thus we have specified the delimiter, semicolon ( ; ) while reading the CSV file as follows:

Python3

import pandas as pd
df = pd.read_csv('student_data1.csv', sep=';')
df

Output

Screenshot-2023-09-30-115530

Use the Correct Encoding

The default encoding used while reading the CSV file in Pandas data frame is utf-8. In case, you are using any special characters in the CSV file, then it’s crucial to use the correct encoding while reading of CSV file, else it will read the CSV file wrong or give the error tokenizing data. You can specify the correct encoding while reading the CSV as follows:

Example: In this example, the CSV file we are reading have special characters in it, thus while reading the CSV file, we are using the ascii encoding as follows:

Python3

import pandas as pd
df = pd.read_csv('student_data1.csv', encoding='ascii')
df

Output

Screenshot-2023-09-30-115530

Skip Rows with Errors

The default way of reading the uploaded CSV file is all the rows whether it has errors or not. In case, you know your data can have some rows which contains error, then it’s essential to specify the skipping the rows while reading of CSV file, else it will read the CSV file wrong or give the error tokenizing data. You can specify skipping the error rows while reading the CSV as follows:

Example: In this example, the CSV file we are reading have some rows containing errors in it, thus while reading the CSV file, we are skipping the rows containing error as follows:

Python3

import pandas as pd
df = pd.read_csv('student_data1.csv', on_bad_lines='skip')
df

Output:

Screenshot-2023-09-30-115530

Fix unbalanced Quotes

There occurs various circumstances the CSV file we are reading contains unbalanced quotes. In such case, it’s necessary to fix the unbalanced quotes while reading the CSV file only. In this method, we will see how we can fix those unbalanced quotes.

Example: In this example, the CSV file we are reading have some unbalanced double quotes in it, thus while reading the CSV file, we are fixing the unbalanced double quotes as follows:

Python3

import pandas as pd
import csv
df = pd.read_csv('student_data1.csv', quoting=csv.QUOTE_NONE, quotechar='"')
df

Output:

Screenshot-2023-09-30-115530

Conclusion:

The reading of incorrect CSV file in Python Pandas can give you the error tokenizing data, but the various ways defined in this article will help you solve the error and properly parse the CSV file in Pandas.

Suggest improvement

How to write Pandas DataFrame as TSV using Python?

Share your thoughts in the comments

How to Fix Python Pandas Error Tokenizing Data

What is Python Pandas Error Tokenizing Data?

Fixing Python Pandas Error Tokenizing Data

Check the CSV file

Specify the Delimiter

Python3

Use the Correct Encoding

Python3

Skip Rows with Errors

Python3

Fix unbalanced Quotes

Python3

Conclusion:

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?