UnicodeDecodeError: ‘utf-8’ codec can’t decode byte

Loading

The UnicodeDecodeError: 'utf-8' codec can't decode byte occurs when Python tries to decode a file or string using the UTF-8 encoding, but the file contains non-UTF-8 characters.


1. Common Causes and Fixes

Cause 1: File Encoding Mismatch

If a file is encoded in ISO-8859-1, Windows-1252, or another encoding but is read as UTF-8, Python throws a UnicodeDecodeError.

Incorrect Code:

with open("data.txt", "r", encoding="utf-8") as file:
content = file.read()

Problem: The file is not actually UTF-8 encoded, but Python is forcing UTF-8.

Solution: Detect and use the correct encoding:

with open("data.txt", "r", encoding="latin-1") as file:  # Try ISO-8859-1 or Windows-1252
content = file.read()

Cause 2: Non-UTF-8 Characters in a UTF-8 String

Some special symbols (e.g., , £, ©) may not be encoded in UTF-8.

Solution: Use errors="ignore" to skip problematic characters:

with open("data.txt", "r", encoding="utf-8", errors="ignore") as file:
content = file.read()

Or use errors="replace" to replace unreadable characters with ?:

with open("data.txt", "r", encoding="utf-8", errors="replace") as file:
content = file.read()

Cause 3: Reading a Binary File as Text

Trying to read a binary file (e.g., images, PDFs, ZIPs) as a text file can cause this error.

Incorrect Code:

with open("image.jpg", "r", encoding="utf-8") as file:
content = file.read()

Problem: JPG files are binary, not text.

Solution: Open the file in binary mode (rb):

with open("image.jpg", "rb") as file:
content = file.read()

Cause 4: Copy-Pasting Non-UTF-8 Characters into a Python String

If you copy-paste text from the web into Python, it may contain non-UTF-8 symbols.

Solution: Convert the string to UTF-8 manually:

text = "Some text with special character: \x96"  # Problematic character
fixed_text = text.encode("utf-8", errors="ignore").decode("utf-8")
print(fixed_text)

Cause 5: Using str() on Non-UTF-8 Data

If a byte string (b'some data') contains non-UTF-8 characters, calling str() without specifying encoding can cause this error.

Incorrect Code:

data = b"\xff\xfeHello"  # Byte string with non-UTF-8 characters
text = str(data) # Fails because it assumes ASCII/UTF-8

Solution: Decode the byte string properly:

text = data.decode("utf-8", errors="ignore")

Cause 6: Reading a CSV File with Non-UTF-8 Encoding

CSV files may be saved in Windows-1252, ISO-8859-1, or Shift-JIS instead of UTF-8.

Incorrect Code:

import pandas as pd

df = pd.read_csv("data.csv", encoding="utf-8") # Fails if file is not UTF-8

Solution: Let Pandas auto-detect the encoding:

df = pd.read_csv("data.csv", encoding="latin-1")  # Try different encodings

Or use chardet to detect the correct encoding:

import chardet

with open("data.csv", "rb") as f:
result = chardet.detect(f.read(10000)) # Read first 10,000 bytes
encoding = result["encoding"]

df = pd.read_csv("data.csv", encoding=encoding)

2. Detecting File Encoding Before Reading

To avoid errors, use chardet to automatically detect the file encoding.

Detect Encoding and Read File Safely:

import chardet

def detect_encoding(file_path):
with open(file_path, "rb") as f:
result = chardet.detect(f.read(10000)) # Analyze first 10,000 bytes
return result["encoding"]

file_path = "data.txt"
encoding = detect_encoding(file_path)

with open(file_path, "r", encoding=encoding, errors="replace") as file:
content = file.read()

3. Summary of Fixes

IssueFix
File encoding mismatchUse encoding="latin-1" or encoding="windows-1252"
Non-UTF-8 characters in UTF-8 fileUse errors="ignore" or errors="replace"
Reading binary files as textOpen in binary mode (rb)
Copy-pasted non-UTF-8 charactersManually encode/decode the string
Using str() on byte dataUse .decode("utf-8", errors="ignore")
CSV file with different encodingUse pandas.read_csv("data.csv", encoding="latin-1")
Detecting unknown encodingUse chardet.detect()

Leave a Reply

Your email address will not be published. Required fields are marked *