The UnicodeDecodeError: 'utf-8' codec can't decode byte
occurs when Python tries to decode a file or string using the UTF-8 encoding, but the file contains non-UTF-8 characters.
1. Common Causes and Fixes
Cause 1: File Encoding Mismatch
If a file is encoded in ISO-8859-1, Windows-1252, or another encoding but is read as UTF-8, Python throws a UnicodeDecodeError
.
Incorrect Code:
with open("data.txt", "r", encoding="utf-8") as file:
content = file.read()
Problem: The file is not actually UTF-8 encoded, but Python is forcing UTF-8.
Solution: Detect and use the correct encoding:
with open("data.txt", "r", encoding="latin-1") as file: # Try ISO-8859-1 or Windows-1252
content = file.read()
Cause 2: Non-UTF-8 Characters in a UTF-8 String
Some special symbols (e.g., €
, £
, ©
) may not be encoded in UTF-8.
Solution: Use errors="ignore"
to skip problematic characters:
with open("data.txt", "r", encoding="utf-8", errors="ignore") as file:
content = file.read()
Or use errors="replace"
to replace unreadable characters with ?
:
with open("data.txt", "r", encoding="utf-8", errors="replace") as file:
content = file.read()
Cause 3: Reading a Binary File as Text
Trying to read a binary file (e.g., images, PDFs, ZIPs) as a text file can cause this error.
Incorrect Code:
with open("image.jpg", "r", encoding="utf-8") as file:
content = file.read()
Problem: JPG files are binary, not text.
Solution: Open the file in binary mode (rb
):
with open("image.jpg", "rb") as file:
content = file.read()
Cause 4: Copy-Pasting Non-UTF-8 Characters into a Python String
If you copy-paste text from the web into Python, it may contain non-UTF-8 symbols.
Solution: Convert the string to UTF-8 manually:
text = "Some text with special character: \x96" # Problematic character
fixed_text = text.encode("utf-8", errors="ignore").decode("utf-8")
print(fixed_text)
Cause 5: Using str()
on Non-UTF-8 Data
If a byte string (b'some data'
) contains non-UTF-8 characters, calling str()
without specifying encoding can cause this error.
Incorrect Code:
data = b"\xff\xfeHello" # Byte string with non-UTF-8 characters
text = str(data) # Fails because it assumes ASCII/UTF-8
Solution: Decode the byte string properly:
text = data.decode("utf-8", errors="ignore")
Cause 6: Reading a CSV File with Non-UTF-8 Encoding
CSV files may be saved in Windows-1252, ISO-8859-1, or Shift-JIS instead of UTF-8.
Incorrect Code:
import pandas as pd
df = pd.read_csv("data.csv", encoding="utf-8") # Fails if file is not UTF-8
Solution: Let Pandas auto-detect the encoding:
df = pd.read_csv("data.csv", encoding="latin-1") # Try different encodings
Or use chardet
to detect the correct encoding:
import chardet
with open("data.csv", "rb") as f:
result = chardet.detect(f.read(10000)) # Read first 10,000 bytes
encoding = result["encoding"]
df = pd.read_csv("data.csv", encoding=encoding)
2. Detecting File Encoding Before Reading
To avoid errors, use chardet
to automatically detect the file encoding.
Detect Encoding and Read File Safely:
import chardet
def detect_encoding(file_path):
with open(file_path, "rb") as f:
result = chardet.detect(f.read(10000)) # Analyze first 10,000 bytes
return result["encoding"]
file_path = "data.txt"
encoding = detect_encoding(file_path)
with open(file_path, "r", encoding=encoding, errors="replace") as file:
content = file.read()
3. Summary of Fixes
Issue | Fix |
---|---|
File encoding mismatch | Use encoding="latin-1" or encoding="windows-1252" |
Non-UTF-8 characters in UTF-8 file | Use errors="ignore" or errors="replace" |
Reading binary files as text | Open in binary mode (rb ) |
Copy-pasted non-UTF-8 characters | Manually encode/decode the string |
Using str() on byte data | Use .decode("utf-8", errors="ignore") |
CSV file with different encoding | Use pandas.read_csv("data.csv", encoding="latin-1") |
Detecting unknown encoding | Use chardet.detect() |