Python Regular Expressions for Data Cleaning – A Detailed Guide
Regular Expressions (RegEx) are a powerful tool for pattern matching and text manipulation in Python. In data cleaning, RegEx helps to extract, replace, or validate text patterns efficiently.
1. What are Regular Expressions (RegEx)?
A regular expression (RegEx) is a sequence of characters defining a search pattern. It is widely used in text processing, data cleaning, validation, and parsing.
Python provides the re
module for handling regular expressions.
2. Importing the re
Module
To use regular expressions in Python, import the re
module:
import re
3. Basic Regular Expression Functions
The re
module provides several functions for pattern matching:
Function | Description |
---|---|
re.search() | Searches for a pattern in a string and returns the first match |
re.match() | Checks if the pattern is at the start of the string |
re.findall() | Returns all occurrences of the pattern in a string |
re.sub() | Replaces occurrences of a pattern |
re.split() | Splits a string based on a pattern |
re.compile() | Compiles a regular expression for reuse |
4. Understanding RegEx Patterns
Metacharacters in RegEx
Metacharacter | Description | Example |
---|---|---|
. | Any character except a newline | h.t → Matches “hat”, “hit” |
^ | Start of string | ^hello → Matches strings that start with “hello” |
$ | End of string | world$ → Matches strings that end with “world” |
* | Zero or more occurrences | ab* → Matches “a”, “ab”, “abb”, “abbb” |
+ | One or more occurrences | ab+ → Matches “ab”, “abb”, “abbb” |
? | Zero or one occurrence | ab? → Matches “a”, “ab” |
{n} | Exactly n occurrences | a{3} → Matches “aaa” |
{n,} | At least n occurrences | a{2,} → Matches “aa”, “aaa”, “aaaa” |
{n,m} | Between n and m occurrences | a{2,4} → Matches “aa”, “aaa”, “aaaa” |
[] | Matches one of the specified characters | [aeiou] → Matches “a”, “e”, “i”, “o”, “u” |
\d | Matches digits (0-9) | \d+ → Matches “123”, “45” |
\D | Matches non-digits | \D+ → Matches “abc” |
\s | Matches whitespace (space, tab) | \s+ → Matches spaces, tabs |
\S | Matches non-whitespace characters | \S+ → Matches “Hello” |
\w | Matches word characters (letters, digits, underscore) | \w+ → Matches “word_123” |
\W | Matches non-word characters | \W+ → Matches punctuation, spaces |
5. Using Regular Expressions for Data Cleaning
Now, let’s explore how to use regular expressions for data cleaning.
5.1 Removing Unwanted Characters
Example: Removing Special Characters
If your data contains unwanted characters like punctuation, you can remove them using re.sub()
.
import re
text = "Hello!! How are you? I'm fine..."
clean_text = re.sub(r'[^\w\s]', '', text) # Remove everything except words and spaces
print(clean_text)
Output:
Hello How are you Im fine
5.2 Removing Extra Spaces
Sometimes, text contains multiple spaces. We can replace them with a single space.
text = "This is a text with extra spaces."
clean_text = re.sub(r'\s+', ' ', text).strip()
print(clean_text)
Output:
This is a text with extra spaces.
5.3 Extracting Numbers from Text
To extract numbers from text, use re.findall()
.
text = "The price of the product is 499 dollars and the discount is 50 dollars."
numbers = re.findall(r'\d+', text)
print(numbers)
Output:
['499', '50']
5.4 Extracting Emails
Emails in a dataset can be extracted using pattern matching.
text = "Contact us at support@example.com or sales@shop.com for inquiries."
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)
Output:
['support@example.com', 'sales@shop.com']
5.5 Extracting Phone Numbers
We can extract phone numbers based on a pattern.
text = "Call us at +1-800-555-1234 or (123) 456-7890."
phones = re.findall(r'\+?\d{1,3}[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print(phones)
Output:
['+1-800-555-1234', '(123) 456-7890']
5.6 Validating Emails
We can validate emails using re.match()
.
def is_valid_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
print(is_valid_email("user@example.com")) # True
print(is_valid_email("user@@example.com")) # False
5.7 Replacing Text (Data Standardization)
We can replace abbreviations with full words.
text = "The flight departs at 10 AM & arrives at 3 PM."
clean_text = re.sub(r'\bAM\b', 'A.M.', text)
clean_text = re.sub(r'\bPM\b', 'P.M.', clean_text)
print(clean_text)
Output:
The flight departs at 10 A.M. & arrives at 3 P.M.
5.8 Extracting URLs from Text
If a dataset contains URLs, we can extract them.
text = "Visit our website at https://www.example.com or follow http://blog.example.org"
urls = re.findall(r'https?://[^\s]+', text)
print(urls)
Output:
['https://www.example.com', 'http://blog.example.org']
6. Practical Use Cases in Data Cleaning
Task | RegEx Used |
---|---|
Removing special characters | [^\w\s] |
Removing extra spaces | \s+ |
Extracting numbers | \d+ |
Extracting emails | [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} |
Extracting phone numbers | \+?\d{1,3}[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} |
Extracting URLs | https?://[^\s]+ |
Validating email addresses | ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ |