Python Regular Expressions Cheatsheet¶

Basic Patterns¶

Literal Characters¶

Regular expressions match literal characters by default.

import re
text = "Python is powerful"
result = re.search("Python", text)  # Matches "Python"

Metacharacters¶

Special characters with meaning in regex must be escaped with a backslash \ to match literally.

# Metacharacters: . ^ $ * + ? { } [ ] \ | ( )
text = "Cost: $25.99"
result = re.search("\$\d+\.\d+", text)  # Matches "$25.99"

Character Classes¶

Single Character Matchers¶

. (dot) matches any character except a newline.

pattern = "c.t"
# Matches "cat", "cut", "c@t", etc.

Character Sets¶

[ ] define a character set - matches any single character within the brackets.

pattern = "[aeiou]"  # Matches any single vowel
re.findall(pattern, "apple")  # Returns ['a', 'e']

# Range of characters
pattern = "[a-z]"  # Matches any lowercase letter
pattern = "[A-Za-z]"  # Matches any letter
pattern = "[0-9]"  # Matches any digit

Negated Character Sets¶

[^ ] matches any character NOT in the set.

pattern = "[^0-9]"  # Matches any non-digit
re.findall(pattern, "abc123")  # Returns ['a', 'b', 'c']

Predefined Character Classes¶

Shorthand notation for common character sets:

\d  # Matches any digit [0-9]
\D  # Matches any non-digit [^0-9]
\w  # Matches any word character [a-zA-Z0-9_]
\W  # Matches any non-word character
\s  # Matches any whitespace character (space, tab, newline)
\S  # Matches any non-whitespace character

# Example
pattern = r"\d\s\w+"  # Digit, followed by whitespace, followed by 1+ word chars
re.search(pattern, "7 apples")  # Matches "7 apples"

Anchors and Boundaries¶

Anchors match positions rather than characters:

^  # Matches the start of a string
$  # Matches the end of a string
\b  # Matches a word boundary
\B  # Matches a non-word boundary

# Examples
pattern = r"^Python"  # Matches "Python" only at the start
re.search(pattern, "Python is great")  # Match
re.search(pattern, "I love Python")    # No match

pattern = r"Python$"  # Matches "Python" only at the end
re.search(pattern, "I love Python")    # Match
re.search(pattern, "Python is great")  # No match

pattern = r"\bcat\b"  # Matches the word "cat" with boundaries
re.search(pattern, "The cat sits")     # Match
re.search(pattern, "category")         # No match

The Dual Meaning of ^¶

It's important to note that the caret symbol ^ has two distinct meanings in regular expressions:

Outside square brackets: When used at the beginning of a pattern, it's an anchor that matches the start of a string or line.
```
pattern = r"^abc"  # Matches "abc" only at the start of the string
```
Inside square brackets: When used as the first character inside square brackets, it negates the character class, meaning "match any character EXCEPT these."
```
pattern = r"[^0-9]"  # Matches any character that is NOT a digit
```

This distinction is crucial when reading and writing regex patterns. For example, in an email validation pattern:

email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"

The first ^ indicates the match must start at the beginning of the string, while in [a-zA-Z0-9._%+-] the characters are simply a positive character class defining what's allowed in the username portion of the email.

Quantifiers¶

Quantifiers specify how many times a pattern should match:

*      # 0 or more repetitions
+      # 1 or more repetitions
?      # 0 or 1 repetition
{n}    # Exactly n repetitions
{n,}   # n or more repetitions
{n,m}  # Between n and m repetitions

# Examples
pattern = r"ab*c"     # Matches "ac", "abc", "abbc", etc.
pattern = r"ab+c"     # Matches "abc", "abbc", etc. (not "ac")
pattern = r"colou?r"  # Matches "color" or "colour"
pattern = r"\d{2,4}"  # Matches 2 to 4 digits

Greedy vs. Non-Greedy¶

Quantifiers are greedy by default (match as much as possible). Adding ? after a quantifier makes it non-greedy.

text = "<div>Content</div><div>More</div>"

# Greedy matching
pattern = r"<div>.*</div>"
re.findall(pattern, text)  # Returns ['<div>Content</div><div>More</div>']

# Non-greedy matching
pattern = r"<div>.*?</div>"
re.findall(pattern, text)  # Returns ['<div>Content</div>', '<div>More</div>']

Grouping and Capturing¶

Parentheses ( ) create capture groups:

pattern = r"(\d{3})-(\d{3})-(\d{4})"
match = re.search(pattern, "Phone: 123-456-7890")
match.group(0)  # Entire match: "123-456-7890"
match.group(1)  # First group: "123"
match.group(2)  # Second group: "456"
match.group(3)  # Third group: "7890"
match.groups()  # All groups: ("123", "456", "7890")

Named Groups¶

Use (?P<name>...) for named groups:

pattern = r"(?P<area>\d{3})-(?P<prefix>\d{3})-(?P<line>\d{4})"
match = re.search(pattern, "Phone: 123-456-7890")
match.group("area")    # "123"
match.group("prefix")  # "456"
match.group("line")    # "7890"

Non-Capturing Groups¶

Use (?:...) for non-capturing groups:

pattern = r"(?:\d{3})-\d{3}-(\d{4})"
match = re.search(pattern, "123-456-7890")
match.groups()  # Only contains ("7890"), the first part isn't captured

Alternation¶

The pipe symbol | works as an OR operator:

pattern = r"cat|dog"
re.findall(pattern, "I have a cat and a dog")  # Returns ['cat', 'dog']

# Grouped alternation
pattern = r"(cat|dog)s?"
re.findall(pattern, "I have cats and a dog")  # Returns ['cat', 'dog']

Lookahead and Lookbehind¶

These are zero-width assertions that don't consume characters:

# Positive lookahead (?=...): Match if followed by pattern
pattern = r"\w+(?=\s+is)"
re.findall(pattern, "Python is great, Java is powerful")  # ['Python', 'Java']

# Negative lookahead (?!...): Match if NOT followed by pattern
pattern = r"Python(?!\s+3)"
re.search(pattern, "Python 2.7")  # Match
re.search(pattern, "Python 3.9")  # No match

# Positive lookbehind (?<=...): Match if preceded by pattern
pattern = r"(?<=\$)\d+"
re.findall(pattern, "Items: $10, $25, €30")  # ['10', '25']

# Negative lookbehind (?<!...): Match if NOT preceded by pattern
pattern = r"(?<!\$)\d+"
re.findall(pattern, "$10, 20, $30")  # ['0', '20', '0']

Common Functions¶

re.search()¶

Finds the first match of the pattern:

result = re.search(r"\d+", "abc123def456")
if result:
    print(result.group())  # "123"

re.match()¶

Matches pattern only at the beginning of the string:

re.match(r"\d+", "123abc")   # Match
re.match(r"\d+", "abc123")   # No match

re.findall()¶

Returns all non-overlapping matches as a list:

re.findall(r"\d+", "abc123def456")  # Returns ['123', '456']

re.finditer()¶

Returns an iterator of match objects:

for match in re.finditer(r"\d+", "abc123def456"):
    print(match.group(), match.span())  # "123" (3, 6), "456" (9, 12)

re.sub()¶

Substitutes matches with a replacement:

# Basic substitution
re.sub(r"\d+", "NUM", "abc123def456")  # Returns "abcNUMdefNUM"

# Using backreferences
re.sub(r"(\d{3})-(\d{3})-(\d{4})", r"(\1) \2-\3", "123-456-7890")
# Returns "(123) 456-7890"

# Using a function for replacement
def double_digits(match):
    return str(int(match.group()) * 2)

re.sub(r"\d+", double_digits, "abc123def456")  # Returns "abc246def912"

re.split()¶

Splits a string by pattern matches:

re.split(r"\s+", "Split   these words")  # Returns ['Split', 'these', 'words']
re.split(r"[,;]", "apple,orange;banana")  # Returns ['apple', 'orange', 'banana']

Flags¶

Modify regex behavior with flags:

re.IGNORECASE  # or re.I: Case-insensitive matching
re.search(r"python", "Python", re.IGNORECASE)  # Match

re.MULTILINE  # or re.M: ^ and $ match start/end of each line
text = "Line 1\nLine 2"
re.findall(r"^Line", text, re.MULTILINE)  # Returns ['Line', 'Line']

re.DOTALL  # or re.S: Dot matches any character including newline
re.search(r"Line 1.+Line 2", "Line 1\nLine 2", re.DOTALL)  # Match

re.VERBOSE  # or re.X: Allow whitespace and comments in pattern
pattern = re.compile(r"""
    \d{3}  # Area code
    [-.]?  # Optional separator
    \d{3}  # Prefix
    [-.]?  # Optional separator
    \d{4}  # Line number
    """, re.VERBOSE)

# Multiple flags
re.search(pattern, text, re.IGNORECASE | re.MULTILINE)

Raw Strings¶

Use raw strings (r"...") to avoid issues with backslashes:

# Without raw string
re.search("\\d+", "123")  # Need double backslash

# With raw string (recommended)
re.search(r"\d+", "123")  # Much cleaner

Practical Examples¶

Email Validation¶

email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
re.match(email_pattern, "user@example.com")  # Match
re.match(email_pattern, "invalid@email")     # No match

URL Extraction¶

url_pattern = r"https?://(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_+.~#?&/=]*)"
re.findall(url_pattern, "Visit https://example.com and http://test.org")
# Returns ['https://example.com', 'http://test.org']

Date Formatting¶

# Convert MM/DD/YYYY to YYYY-MM-DD
date_text = "Today's date: 12/25/2023"
re.sub(r"(\d{2})/(\d{2})/(\d{4})", r"\3-\1-\2", date_text)
# Returns "Today's date: 2023-12-25"

Password Validation¶

# At least 8 chars with 1+ uppercase, 1+ lowercase, 1+ digit, 1+ special char
password_pattern = r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$"
re.match(password_pattern, "Passw0rd!")  # Match
re.match(password_pattern, "password")   # No match

Stripping HTML Tags¶

html_text = "<p>This is <b>bold</b> text.</p>"
re.sub(r"<[^>]*>", "", html_text)  # Returns "This is bold text."

Extracting Quoted Text¶

text = 'She said "hello" and he replied "goodbye"'
re.findall(r'"([^"]*)"', text)  # Returns ['hello', 'goodbye']

Performance Tips¶

Compile patterns for repeated use:

pattern = re.compile(r"\d+")
pattern.findall("123 456")  # More efficient for multiple operations

Avoid unnecessary backtracking:

# Instead of r"a.*z"
# Use r"a[^z]*z" if possible

Use non-capturing groups when you don't need the captured text:

# Instead of r"(pattern)"
# Use r"(?:pattern)" when you don't need to reference the group

Be specific rather than using broad patterns:

# Instead of r".*"
# Use r"\d+" if you're specifically looking for digits

Use appropriate anchors to limit search space:

# Instead of r"pattern"
# Use r"^pattern$" if you want to match the entire string