Python正则表达式：findall正则表达式，用于将字符串与棘手的规范相匹配，并将最终结果放在单词列表中

提问者：小点点

Python正则表达式：findall正则表达式，用于将字符串与棘手的规范相匹配，并将最终结果放在单词列表中

我有一个字符串：

sample_input = """
This film is based on Isabel Allende's not-so-much-better novel. I hate Meryl
Streep and Antonio Banderas (in non-Spanish films), and the other actors,
including Winona, my favourite actress and Jeremy Irons try hard to get over
such a terrible script.

我想对其应用正则表达式，以便它可以生成所需的输出：

['this', 'film', 'is', 'based', 'on', 'isabel', "allende's", 'not-so', 'much-better', 'novel', 'i', 'hate', 'meryl', 'streep', 'and', 'antonio', 'banderas', 'in', 'non-spanish', 'films', 'and', 'the', 'other', 'actors', 'including', 'winona', 'my', 'favourite', 'actress', 'and', 'jeremy', 'irons', 'try', 'hard', 'to', 'get', 'over', 'such', 'a', 'terrible', 'script']

我想创建一个具有以下规则的单词列表（全部小写）：

一个单词必须以单个字母或数字开头和结尾。
一个单词只能有一个连字符（-）或一个appostraphe（'）
如果违反1或2则是一个新词

**有关详细信息，请参阅所需输出。

请注意，正则表达式只能在一个单词中使用一个连字符或一个撇号，但每个单词最多只能使用一个连字符或一个撇号。

我尝试了以下代码：

sample_output_regex = re.findall(r'[a-zA-Z0-9]*[-]?|[\']?[a-zA-Z0-9]*', sample_input.lower())

但产出相当差：

['', 'this', '', 'film', '', 'is', '', 'based', '', 'on', '', 'isabel', '', 'allende', '', "'s", '', 'not-', 'so-', 'much-', 'better', '', 'novel', '', '', 'i', '', 'hate', '', 'meryl', '', 'streep', '', 'and', '', 'antonio', '', 'banderas', '', '', 'in', '', 'non-', 'spanish', '', 'films', '', '', '', 'and', '', 'the', '', 'other', '', 'actors', '', '', 'including', '', 'winona', '', '', 'my', '', 'favourite', '', 'actress', '', 'and', '', 'jeremy', '', 'irons', '', 'try', '', 'hard', '', 'to', '', 'get', '', 'over', '', 'such', '', 'a', '', 'terrible', '', 'script', '', '', '']

为了更好地处理正则表达式，我想知道我的正则表达式代码在哪里。我如何改变它以获得我想要的输出。详情将不胜感激。例如，为什么空间被拉过“当我的正则表达式不要求匹配空间时”？

共1个答案

匿名用户

关于模式：

您得到的是空条目，因为模式中的所有部分[a-zA-Z0-9]*[-]|[\']?[a-zA-Z0-9]*是可选的。

由于|的交替，例如non-so将不会是单个匹配，因为-之后的部分将不会匹配。

您可以使用以下方法：

\b[a-zA-Z0-9]+(?:[-'][a-zA-Z0-9]+)?\b

图案吻合

\b单词边界
[a-zA-Z0-9]匹配所列范围的1倍

（？：非捕获组

[-'][a-zA-Z0-9]匹配单个-或'和所列范围中的一个

正则表达式演示

然后，您可以将所有匹配转换为小写。

import re

sample_input = """
This film is based on Isabel Allende's not-so-much-better novel. I hate Meryl
Streep and Antonio Banderas (in non-Spanish films), and the other actors,
including Winona, my favourite actress and Jeremy Irons try hard to get over
such a terrible script."""

res = [x.lower() for x in re.findall(r"\b[a-zA-Z0-9]+(?:[-'][a-zA-Z0-9]+)?\b", sample_input)]
print(res)

输出

['this', 'film', 'is', 'based', 'on', 'isabel', "allende's", 'not-so', 'much-better', 'novel', 'i', 'hate', 'meryl', 'streep', 'and', 'antonio', 'banderas', 'in', 'non-spanish', 'films', 'and', 'the', 'other', 'actors', 'including', 'winona', 'my', 'favourite', 'actress', 'and', 'jeremy', 'irons', 'try', 'hard', 'to', 'get', 'over', 'such', 'a', 'terrible', 'script']

Python正则表达式：findall正则表达式，用于将字符串与棘手的规范相匹配，并将最终结果放在单词列表中

共1个答案

相关问题

热门标签

Python正则表达式：findall正则表达式，用于将字符串与棘手的规范相匹配，并将最终结果放在单词列表中

共1个答案

相关问题

热门标签

微信关注