Open In App

Character Encoding Detection With Chardet in Python

Last Updated : 21 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

We are given some characters in the form of text files, unknown encoded text, and website content and our task is to detect the character encoding with Chardet in Python. In this article, we will see how we can perform character encoding detection with Chardet in Python.

Example:

Input: data = b'\xff\xfe\x41\x00\x42\x00\x43\x00'
Output: UTF-16
Explanation: Encoding is detected of the above given data.

Character Encoding Detection With Chardet in Python

Below are some of the examples by which we can understand how to detect the character encoding with Chardet in Python:

Installing Chardet in Python

First of all, we will install chardet in Python by using the following command and then we will perform other operations to detect character encoding in Python:

pip install chardet

Example 1: Detecting Encoding of a String

In this example, the Python script uses the chardet library to detect the character encoding of a given byte sequence (data). The detected encoding and its confidence level are printed, revealing information about the encoding scheme of the provided binary data.

Python3
import chardet

# String with unknown encoding
data = b'\xff\xfe\x41\x00\x42\x00\x43\x00'

# Detect the encoding
result = chardet.detect(data)
print(result['encoding'])

Output:

UTF-16

Example 2: Detecting Encoding of a Website Content

In this example, the Python script utilizes the requests library to fetch the HTML content of the GeeksforGeeks webpage. The chardet library is then employed to detect the character encoding of the retrieved content. The detected encoding and its confidence level are printed, providing insights into the encoding scheme used by the webpage.

Python3
import requests
import chardet

# Fetch the web page content
response = requests.get('https://www.geeksforgeeks.org/')
html_content = response.content

# Detect the encoding
result = chardet.detect(html_content)
print(result['encoding'])

Output:

utf-8

Example 3: Detecting Encoding of a Text File

In this example, the Python script reads the content of a text file (‘utf-8.txt’) in binary mode using open and rb. The chardet library is then used to detect the character encoding of the file’s content. The detected encoding and its confidence level are printed, offering information about the encoding scheme used in the specified text file.

utf-8.txt

utf-8

Python3
import chardet

# Read the text file
with open('utf-8.txt', 'rb') as f:
    data = f.read()

# Detect the encoding
result = chardet.detect(data)
print(result['encoding'])

Output:

utf-8

Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads