Python Regular Expressions for Data Cleaning

Loading

Python Regular Expressions for Data Cleaning – A Detailed Guide

Regular Expressions (RegEx) are a powerful tool for pattern matching and text manipulation in Python. In data cleaning, RegEx helps to extract, replace, or validate text patterns efficiently.


1. What are Regular Expressions (RegEx)?

A regular expression (RegEx) is a sequence of characters defining a search pattern. It is widely used in text processing, data cleaning, validation, and parsing.

Python provides the re module for handling regular expressions.


2. Importing the re Module

To use regular expressions in Python, import the re module:

import re

3. Basic Regular Expression Functions

The re module provides several functions for pattern matching:

FunctionDescription
re.search()Searches for a pattern in a string and returns the first match
re.match()Checks if the pattern is at the start of the string
re.findall()Returns all occurrences of the pattern in a string
re.sub()Replaces occurrences of a pattern
re.split()Splits a string based on a pattern
re.compile()Compiles a regular expression for reuse

4. Understanding RegEx Patterns

Metacharacters in RegEx

MetacharacterDescriptionExample
.Any character except a newlineh.t → Matches “hat”, “hit”
^Start of string^hello → Matches strings that start with “hello”
$End of stringworld$ → Matches strings that end with “world”
*Zero or more occurrencesab* → Matches “a”, “ab”, “abb”, “abbb”
+One or more occurrencesab+ → Matches “ab”, “abb”, “abbb”
?Zero or one occurrenceab? → Matches “a”, “ab”
{n}Exactly n occurrencesa{3} → Matches “aaa”
{n,}At least n occurrencesa{2,} → Matches “aa”, “aaa”, “aaaa”
{n,m}Between n and m occurrencesa{2,4} → Matches “aa”, “aaa”, “aaaa”
[]Matches one of the specified characters[aeiou] → Matches “a”, “e”, “i”, “o”, “u”
\dMatches digits (0-9)\d+ → Matches “123”, “45”
\DMatches non-digits\D+ → Matches “abc”
\sMatches whitespace (space, tab)\s+ → Matches spaces, tabs
\SMatches non-whitespace characters\S+ → Matches “Hello”
\wMatches word characters (letters, digits, underscore)\w+ → Matches “word_123”
\WMatches non-word characters\W+ → Matches punctuation, spaces

5. Using Regular Expressions for Data Cleaning

Now, let’s explore how to use regular expressions for data cleaning.


5.1 Removing Unwanted Characters

Example: Removing Special Characters

If your data contains unwanted characters like punctuation, you can remove them using re.sub().

import re

text = "Hello!! How are you? I'm fine..."
clean_text = re.sub(r'[^\w\s]', '', text)  # Remove everything except words and spaces
print(clean_text)

Output:

Hello How are you Im fine

5.2 Removing Extra Spaces

Sometimes, text contains multiple spaces. We can replace them with a single space.

text = "This   is    a   text   with   extra    spaces."
clean_text = re.sub(r'\s+', ' ', text).strip()
print(clean_text)

Output:

This is a text with extra spaces.

5.3 Extracting Numbers from Text

To extract numbers from text, use re.findall().

text = "The price of the product is 499 dollars and the discount is 50 dollars."
numbers = re.findall(r'\d+', text)
print(numbers)

Output:

['499', '50']

5.4 Extracting Emails

Emails in a dataset can be extracted using pattern matching.

text = "Contact us at support@example.com or sales@shop.com for inquiries."
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)

Output:

['support@example.com', 'sales@shop.com']

5.5 Extracting Phone Numbers

We can extract phone numbers based on a pattern.

text = "Call us at +1-800-555-1234 or (123) 456-7890."
phones = re.findall(r'\+?\d{1,3}[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print(phones)

Output:

['+1-800-555-1234', '(123) 456-7890']

5.6 Validating Emails

We can validate emails using re.match().

def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

print(is_valid_email("user@example.com"))  # True
print(is_valid_email("user@@example.com"))  # False

5.7 Replacing Text (Data Standardization)

We can replace abbreviations with full words.

text = "The flight departs at 10 AM & arrives at 3 PM."
clean_text = re.sub(r'\bAM\b', 'A.M.', text)
clean_text = re.sub(r'\bPM\b', 'P.M.', clean_text)
print(clean_text)

Output:

The flight departs at 10 A.M. & arrives at 3 P.M.

5.8 Extracting URLs from Text

If a dataset contains URLs, we can extract them.

text = "Visit our website at https://www.example.com or follow http://blog.example.org"
urls = re.findall(r'https?://[^\s]+', text)
print(urls)

Output:

['https://www.example.com', 'http://blog.example.org']

6. Practical Use Cases in Data Cleaning

TaskRegEx Used
Removing special characters[^\w\s]
Removing extra spaces\s+
Extracting numbers\d+
Extracting emails[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Extracting phone numbers\+?\d{1,3}[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}
Extracting URLshttps?://[^\s]+
Validating email addresses^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Leave a Reply

Your email address will not be published. Required fields are marked *