UnicodeDecodeError: 'utf-8' codec can't decode byte

The UnicodeDecodeError: 'utf-8' codec can't decode byte occurs when Python tries to decode a file or string using the UTF-8 encoding, but the file contains non-UTF-8 characters.

1. Common Causes and Fixes

Cause 1: File Encoding Mismatch

If a file is encoded in ISO-8859-1, Windows-1252, or another encoding but is read as UTF-8, Python throws a UnicodeDecodeError.

Incorrect Code:

with open("data.txt", "r", encoding="utf-8") as file:
    content = file.read()

Problem: The file is not actually UTF-8 encoded, but Python is forcing UTF-8.

Solution: Detect and use the correct encoding:

with open("data.txt", "r", encoding="latin-1") as file:  # Try ISO-8859-1 or Windows-1252
    content = file.read()

Cause 2: Non-UTF-8 Characters in a UTF-8 String

Some special symbols (e.g., €, £, ©) may not be encoded in UTF-8.

Solution: Use `errors="ignore"` to skip problematic characters:

with open("data.txt", "r", encoding="utf-8", errors="ignore") as file:
    content = file.read()

Or use errors="replace" to replace unreadable characters with ?:

with open("data.txt", "r", encoding="utf-8", errors="replace") as file:
    content = file.read()

Cause 3: Reading a Binary File as Text

Trying to read a binary file (e.g., images, PDFs, ZIPs) as a text file can cause this error.

Incorrect Code:

with open("image.jpg", "r", encoding="utf-8") as file:
    content = file.read()

Problem: JPG files are binary, not text.

Solution: Open the file in binary mode (rb):

with open("image.jpg", "rb") as file:
    content = file.read()

Cause 4: Copy-Pasting Non-UTF-8 Characters into a Python String

If you copy-paste text from the web into Python, it may contain non-UTF-8 symbols.

Solution: Convert the string to UTF-8 manually:

text = "Some text with special character: \x96"  # Problematic character
fixed_text = text.encode("utf-8", errors="ignore").decode("utf-8")
print(fixed_text)

Cause 5: Using `str()` on Non-UTF-8 Data

If a byte string (b'some data') contains non-UTF-8 characters, calling str() without specifying encoding can cause this error.

Incorrect Code:

data = b"\xff\xfeHello"  # Byte string with non-UTF-8 characters
text = str(data)  # Fails because it assumes ASCII/UTF-8

Solution: Decode the byte string properly:

text = data.decode("utf-8", errors="ignore")

Cause 6: Reading a CSV File with Non-UTF-8 Encoding

CSV files may be saved in Windows-1252, ISO-8859-1, or Shift-JIS instead of UTF-8.

Incorrect Code:

import pandas as pd

df = pd.read_csv("data.csv", encoding="utf-8")  # Fails if file is not UTF-8

Solution: Let Pandas auto-detect the encoding:

df = pd.read_csv("data.csv", encoding="latin-1")  # Try different encodings

Or use chardet to detect the correct encoding:

import chardet

with open("data.csv", "rb") as f:
    result = chardet.detect(f.read(10000))  # Read first 10,000 bytes
    encoding = result["encoding"]

df = pd.read_csv("data.csv", encoding=encoding)

2. Detecting File Encoding Before Reading

To avoid errors, use chardet to automatically detect the file encoding.

Detect Encoding and Read File Safely:

import chardet

def detect_encoding(file_path):
    with open(file_path, "rb") as f:
        result = chardet.detect(f.read(10000))  # Analyze first 10,000 bytes
    return result["encoding"]

file_path = "data.txt"
encoding = detect_encoding(file_path)

with open(file_path, "r", encoding=encoding, errors="replace") as file:
    content = file.read()

3. Summary of Fixes

Issue	Fix
File encoding mismatch	Use `encoding="latin-1"` or `encoding="windows-1252"`
Non-UTF-8 characters in UTF-8 file	Use `errors="ignore"` or `errors="replace"`
Reading binary files as text	Open in binary mode (`rb`)
Copy-pasted non-UTF-8 characters	Manually encode/decode the string
Using `str()` on byte data	Use `.decode("utf-8", errors="ignore")`
CSV file with different encoding	Use `pandas.read_csv("data.csv", encoding="latin-1")`
Detecting unknown encoding	Use `chardet.detect()`

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte

1. Common Causes and Fixes

Cause 1: File Encoding Mismatch

Incorrect Code:

Cause 2: Non-UTF-8 Characters in a UTF-8 String

Solution: Use `errors="ignore"` to skip problematic characters:

Cause 3: Reading a Binary File as Text

Incorrect Code:

Cause 4: Copy-Pasting Non-UTF-8 Characters into a Python String

Solution: Convert the string to UTF-8 manually:

Cause 5: Using `str()` on Non-UTF-8 Data

Incorrect Code:

Cause 6: Reading a CSV File with Non-UTF-8 Encoding

Incorrect Code:

2. Detecting File Encoding Before Reading

3. Summary of Fixes

Leave a Reply Cancel reply

1. Common Causes and Fixes

Cause 1: File Encoding Mismatch

Incorrect Code:

Cause 2: Non-UTF-8 Characters in a UTF-8 String

Solution: Use errors="ignore" to skip problematic characters:

Cause 3: Reading a Binary File as Text

Incorrect Code:

Cause 4: Copy-Pasting Non-UTF-8 Characters into a Python String

Solution: Convert the string to UTF-8 manually:

Cause 5: Using str() on Non-UTF-8 Data

Incorrect Code:

Cause 6: Reading a CSV File with Non-UTF-8 Encoding

Incorrect Code:

2. Detecting File Encoding Before Reading

3. Summary of Fixes

Leave a Reply Cancel reply

Solution: Use `errors="ignore"` to skip problematic characters:

Cause 5: Using `str()` on Non-UTF-8 Data