ai

re 模块是 Python 中用于正则表达式操作的标准库模块，提供了强大的字符串匹配和处理功能。

正则表达式基础 #

1. 常用元字符 #

字符	描述
`.`	匹配任意字符（除了换行符）
`^`	匹配字符串开头
`$`	匹配字符串结尾
`*`	匹配前一个字符0次或多次
`+`	匹配前一个字符1次或多次
`?`	匹配前一个字符0次或1次
`{m}`	匹配前一个字符m次
`{m,n}`	匹配前一个字符m到n次
`[...]`	匹配字符集合中的任意一个字符
`[^...]`	匹配不在字符集合中的任意一个字符
`\	`	或运算符
`\d`	匹配数字，等价于[0-9]
`\D`	匹配非数字
`\s`	匹配空白字符
`\S`	匹配非空白字符
`\w`	匹配单词字符，等价于[A-Za-z0-9_]
`\W`	匹配非单词字符

2. 分组和捕获 #

表达式	描述
`(...)`	捕获分组
`(?:...)`	非捕获分组
`(?P<name>...)`	命名分组

re 模块主要函数 #

1. re.match() #

从字符串开头匹配模式：

import re

result = re.match(r'hello', 'hello world')
print(result.group())  # 输出: hello

2. re.search() #

在字符串中搜索第一个匹配项：

result = re.search(r'world', 'hello world')
print(result.group())  # 输出: world

3. re.findall() #

查找所有匹配项，返回列表：

results = re.findall(r'\d+', '12 apples, 34 oranges')
print(results)  # 输出: ['12', '34']

4. re.finditer() #

查找所有匹配项，返回迭代器：

for match in re.finditer(r'\d+', '12 apples, 34 oranges'):
    print(match.group())
# 输出:
# 12
# 34

5. re.sub() #

替换匹配的字符串：

text = re.sub(r'\d+', 'NUM', '12 apples, 34 oranges')
print(text)  # 输出: NUM apples, NUM oranges

6. re.compile() #

编译正则表达式对象，提高重复使用效率：

pattern = re.compile(r'\d+')
result = pattern.findall('12 apples, 34 oranges')
print(result)  # 输出: ['12', '34']

高级用法 #

1. 分组提取 #

text = "John: 30, Jane: 25"
pattern = r'(\w+): (\d+)'

for name, age in re.findall(pattern, text):
    print(f"{name} is {age} years old")
# 输出:
# John is 30 years old
# Jane is 25 years old

2. 命名分组 #

text = "Date: 2023-05-15"
pattern = r'Date: (?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'

match = re.search(pattern, text)
print(match.groupdict())
# 输出: {'year': '2023', 'month': '05', 'day': '15'}

3. 非贪婪匹配 #

# 贪婪匹配
print(re.search(r'<.*>', '<a> <b>').group())  # 输出: <a> <b>

# 非贪婪匹配
print(re.search(r'<.*?>', '<a> <b>').group())  # 输出: <a>

4. 前后查找 #

# 正向肯定预查
print(re.findall(r'\w+(?=:)', 'John: 30, Jane: 25'))  # 输出: ['John', 'Jane']

# 正向否定预查
print(re.findall(r'\d+(?! years)', '30 years, 25 months'))  # 输出: ['25']

实用示例 #

1. 验证电子邮件 #

def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

print(is_valid_email('test@example.com'))  # True
print(is_valid_email('invalid.email'))    # False

2. 提取URL #

text = "Visit https://www.example.com or http://test.org"
urls = re.findall(r'https?://[^\s]+', text)
print(urls)  # 输出: ['https://www.example.com', 'http://test.org']

3. 清理HTML标签 #

def remove_html_tags(text):
    clean = re.compile(r'<.*?>')
    return re.sub(clean, '', text)

html = "<p>This is <b>bold</b> text</p>"
print(remove_html_tags(html))  # 输出: This is bold text

性能优化建议 #

预编译正则表达式：对于重复使用的模式，使用 re.compile()
使用原始字符串：正则表达式前加 r 避免转义问题
避免过度回溯：谨慎使用 .* 和嵌套量词
使用非贪婪匹配：在适当场合使用 *?、+? 等非贪婪量词
考虑字符串方法：简单匹配优先使用字符串方法如 str.startswith()

常见问题解决 #

1. 匹配多行文本 #

text = """First line
Second line
Third line"""

# 不使用MULTILINE标志
print(re.findall(r'^.*$', text))  # 只匹配整个字符串

# 使用MULTILINE标志
print(re.findall(r'^.*$', text, re.MULTILINE))  # 匹配每一行

2. 忽略大小写匹配 #

print(re.search(r'hello', 'HELLO world', re.IGNORECASE).group())  # 输出: HELLO

3. 处理Unicode字符 #

print(re.findall(r'\w+', 'Привет мир', re.UNICODE))  # 输出: ['Привет', 'мир']

总结 #

Python 的 re 模块提供了完整的正则表达式功能，掌握它可以高效处理各种复杂的字符串匹配和提取任务。关键点包括：

理解基本元字符和语法
熟悉 re 模块的主要函数
掌握分组和命名分组的使用
了解性能优化技巧
学会处理常见边界情况

正则表达式基础 #

1. 常用元字符 #

2. 分组和捕获 #

re 模块主要函数 #

1. re.match() #

2. re.search() #

3. re.findall() #

4. re.finditer() #

5. re.sub() #

6. re.compile() #

高级用法 #

1. 分组提取 #

2. 命名分组 #

3. 非贪婪匹配 #

4. 前后查找 #

实用示例 #

1. 验证电子邮件 #

2. 提取URL #

3. 清理HTML标签 #

性能优化建议 #

常见问题解决 #

1. 匹配多行文本 #

2. 忽略大小写匹配 #

3. 处理Unicode字符 #

总结 #

访问验证