Regular expressions (RegEx) are a powerful tool for string manipulation, pattern matching, and text processing. In Python, the re
module provides support for working with regular expressions. Whether you're validating email addresses, extracting phone numbers, or parsing complex data, mastering RegEx can make your code more efficient and concise.
In this comprehensive guide, we will walk you through the essentials of Python regular expressions, explain how they work, and provide practical examples to help you harness the full potential of RegEx in Python.
re
ModuleA regular expression (RegEx) is a sequence of characters that define a search pattern. RegEx is commonly used for string searching and pattern matching in text, such as validating formats (like email addresses), searching for specific patterns, and replacing parts of strings.
In Python, the re
module provides a set of functions that allows you to work with RegEx.
A simple regular expression to match a digit is:
\d
This matches any single digit (0-9).
re
ModulePython's re
module provides several functions for working with regular expressions. To get started, you need to import the module:
import re
The core functionality of re
revolves around searching for patterns, matching strings, and manipulating text. Below are some of the commonly used functions:
re
Modulere.match()
: Checks if the regular expression matches at the start of the string.re.search()
: Searches for the first occurrence of the regular expression anywhere in the string.re.findall()
: Finds all non-overlapping matches of the regular expression in the string and returns them as a list.re.sub()
: Replaces parts of the string that match the regular expression with a new string.re.split()
: Splits the string at all matches of the regular expression.Let’s break down the basic components of regular expressions and how they are used in Python.
A literal character in a regular expression is a character that matches itself. For example, the regular expression "abc"
will match the string "abc"
.
Some characters have special meanings in regular expressions:
.
(Dot): Matches any character except a newline.\d
: Matches any digit (0-9).\w
: Matches any alphanumeric character (letters and digits).\s
: Matches any whitespace character (spaces, tabs, newlines).^
: Anchors the match at the beginning of the string.$
: Anchors the match at the end of the string.[]
: Defines a character class, matching any single character within the brackets.|
: Acts as a logical OR, matching either the pattern on the left or right.*
: Matches 0 or more repetitions of the preceding character or group.+
: Matches 1 or more repetitions of the preceding character or group.?
: Makes the preceding character or group optional (matches 0 or 1 time).
import re
pattern = r'\d'
text = "There are 3 apples"
match = re.search(pattern, text)
if match:
print("Match found:", match.group())
Output:
Match found: 3
In this example, the regular expression \d
matches the digit 3
in the string.
re.match()
re.match()
only checks if the regular expression matches the beginning of the string. If there is a match, it returns a match object; otherwise, it returns None
.
import re
pattern = r'^abc'
text = "abcdef"
match = re.match(pattern, text)
if match:
print("Match found:", match.group())
else:
print("No match")
Output:
Match found: abc
re.search()
re.search()
searches for the first match of the regular expression anywhere in the string.
import re
pattern = r'abc'
text = "123 abc 456"
search_result = re.search(pattern, text)
if search_result:
print("Match found:", search_result.group())
else:
print("No match")
Output:
Match found: abc
re.findall()
re.findall()
returns a list of all non-overlapping matches in the string.
import re
pattern = r'\d+'
text = "I have 2 apples and 3 oranges"
numbers = re.findall(pattern, text)
print(numbers)
Output:
['2', '3']
re.sub()
re.sub()
replaces occurrences of the pattern with a new string.
import re
pattern = r'apples'
text = "I have 2 apples and 3 apples"
new_text = re.sub(pattern, "oranges", text)
print(new_text)
Output:
I have 2 oranges and 3 oranges
re.split()
re.split()
splits the string at all matches of the regular expression.
import re
pattern = r'\s+'
text = "This is a test"
words = re.split(pattern, text)
print(words)
Output:
['This', 'is', 'a', 'test']
You can group parts of the regular expression using parentheses ()
. This allows you to capture specific parts of the match.
import re
pattern = r'(\d+)-(\d+)-(\d+)'
text = "My phone number is 123-456-7890"
match = re.search(pattern, text)
if match:
print("Area code:", match.group(1))
print("Prefix:", match.group(2))
print("Line number:", match.group(3))
Output:
Area code: 123
Prefix: 456
Line number: 7890
If you want to group parts of a regular expression without capturing them for later use, you can use a non-capturing group with (?:...)
.
import re
pattern = r'(?:\d{3}-){2}\d{4}'
text = "My phone number is 123-456-7890"
match = re.search(pattern, text)
if match:
print("Phone number matched:", match.group())
Output:
Phone number matched: 123-456-7890
Here are some practical examples of how RegEx can be used in Python:
import re
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
email = "test@example.com"
if re.match(pattern, email):
print("Valid email")
else:
print("Invalid email")
import re
pattern = r'\d{2}/\d{2}/\d{4}'
text = "The event is scheduled for 12/25/2024."
dates = re.findall(pattern, text)
print("Extracted dates:", dates)
Compile Regular Expressions: For improved performance, especially when using the same pattern multiple times, you can compile the regular expression using re.compile()
. This can speed up execution as it avoids recompiling the pattern.
pattern = re.compile(r'\d+')
result = pattern.findall("123 456 789")
Avoid Greedy Matching: Be mindful of greedy quantifiers like *
and +
, which try to match as much text as possible. In some cases, this can lead to inefficient matches or unexpected results. Use non-greedy quantifiers (e.g., *?
, +?
) when appropriate.
Use Raw Strings for RegEx: Always use raw string literals (r"pattern"
) to avoid issues with escape sequences like \n
or \t
.