Python Regular Expressions for Data Cleaning – A Detailed Guide

Regular Expressions (RegEx) are a powerful tool for pattern matching and text manipulation in Python. In data cleaning, RegEx helps to extract, replace, or validate text patterns efficiently.

1. What are Regular Expressions (RegEx)?

A regular expression (RegEx) is a sequence of characters defining a search pattern. It is widely used in text processing, data cleaning, validation, and parsing.

Python provides the re module for handling regular expressions.

2. Importing the `re` Module

To use regular expressions in Python, import the re module:

import re

3. Basic Regular Expression Functions

The re module provides several functions for pattern matching:

Function	Description
`re.search()`	Searches for a pattern in a string and returns the first match
`re.match()`	Checks if the pattern is at the start of the string
`re.findall()`	Returns all occurrences of the pattern in a string
`re.sub()`	Replaces occurrences of a pattern
`re.split()`	Splits a string based on a pattern
`re.compile()`	Compiles a regular expression for reuse

4. Understanding RegEx Patterns

Metacharacters in RegEx

Metacharacter	Description	Example
`.`	Any character except a newline	`h.t` → Matches “hat”, “hit”
`^`	Start of string	`^hello` → Matches strings that start with “hello”
`$`	End of string	`world$` → Matches strings that end with “world”
`*`	Zero or more occurrences	`ab*` → Matches “a”, “ab”, “abb”, “abbb”
`+`	One or more occurrences	`ab+` → Matches “ab”, “abb”, “abbb”
`?`	Zero or one occurrence	`ab?` → Matches “a”, “ab”
`{n}`	Exactly n occurrences	`a{3}` → Matches “aaa”
`{n,}`	At least n occurrences	`a{2,}` → Matches “aa”, “aaa”, “aaaa”
`{n,m}`	Between n and m occurrences	`a{2,4}` → Matches “aa”, “aaa”, “aaaa”
`[]`	Matches one of the specified characters	`[aeiou]` → Matches “a”, “e”, “i”, “o”, “u”
`\d`	Matches digits (0-9)	`\d+` → Matches “123”, “45”
`\D`	Matches non-digits	`\D+` → Matches “abc”
`\s`	Matches whitespace (space, tab)	`\s+` → Matches spaces, tabs
`\S`	Matches non-whitespace characters	`\S+` → Matches “Hello”
`\w`	Matches word characters (letters, digits, underscore)	`\w+` → Matches “word_123”
`\W`	Matches non-word characters	`\W+` → Matches punctuation, spaces

5. Using Regular Expressions for Data Cleaning

Now, let’s explore how to use regular expressions for data cleaning.

5.1 Removing Unwanted Characters

Example: Removing Special Characters

If your data contains unwanted characters like punctuation, you can remove them using re.sub().

import re

text = "Hello!! How are you? I'm fine..."
clean_text = re.sub(r'[^\w\s]', '', text)  # Remove everything except words and spaces
print(clean_text)

Output:

Hello How are you Im fine

5.2 Removing Extra Spaces

Sometimes, text contains multiple spaces. We can replace them with a single space.

text = "This   is    a   text   with   extra    spaces."
clean_text = re.sub(r'\s+', ' ', text).strip()
print(clean_text)

Output:

This is a text with extra spaces.

5.3 Extracting Numbers from Text

To extract numbers from text, use re.findall().

text = "The price of the product is 499 dollars and the discount is 50 dollars."
numbers = re.findall(r'\d+', text)
print(numbers)

Output:

['499', '50']

5.4 Extracting Emails

Emails in a dataset can be extracted using pattern matching.

text = "Contact us at support@example.com or sales@shop.com for inquiries."
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)

Output:

['support@example.com', 'sales@shop.com']

5.5 Extracting Phone Numbers

We can extract phone numbers based on a pattern.

text = "Call us at +1-800-555-1234 or (123) 456-7890."
phones = re.findall(r'\+?\d{1,3}[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print(phones)

Output:

['+1-800-555-1234', '(123) 456-7890']

5.6 Validating Emails

We can validate emails using re.match().

def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

print(is_valid_email("user@example.com"))  # True
print(is_valid_email("user@@example.com"))  # False

5.7 Replacing Text (Data Standardization)

We can replace abbreviations with full words.

text = "The flight departs at 10 AM & arrives at 3 PM."
clean_text = re.sub(r'\bAM\b', 'A.M.', text)
clean_text = re.sub(r'\bPM\b', 'P.M.', clean_text)
print(clean_text)

Output:

The flight departs at 10 A.M. & arrives at 3 P.M.

5.8 Extracting URLs from Text

If a dataset contains URLs, we can extract them.

text = "Visit our website at https://www.example.com or follow http://blog.example.org"
urls = re.findall(r'https?://[^\s]+', text)
print(urls)

Output:

['https://www.example.com', 'http://blog.example.org']

6. Practical Use Cases in Data Cleaning

Task	RegEx Used
Removing special characters	`[^\w\s]`
Removing extra spaces	`\s+`
Extracting numbers	`\d+`
Extracting emails	`[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
Extracting phone numbers	`\+?\d{1,3}[-.\s]?$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}`
Extracting URLs	`https?://[^\s]+`
Validating email addresses	`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`

Python Regular Expressions for Data Cleaning – A Detailed Guide

1. What are Regular Expressions (RegEx)?

2. Importing the re Module

3. Basic Regular Expression Functions

4. Understanding RegEx Patterns

Metacharacters in RegEx

5. Using Regular Expressions for Data Cleaning

5.1 Removing Unwanted Characters

Example: Removing Special Characters

5.2 Removing Extra Spaces

5.3 Extracting Numbers from Text

5.4 Extracting Emails

5.5 Extracting Phone Numbers

5.6 Validating Emails

5.7 Replacing Text (Data Standardization)

5.8 Extracting URLs from Text

6. Practical Use Cases in Data Cleaning

Leave a Reply Cancel reply

2. Importing the `re` Module