从维基百科页面提取部分的正则表达式[重复]

提问者：小点点

从维基百科页面提取部分的正则表达式[重复]

我正在尝试解析维基百科页面，需要使用正则表达式提取页面的特定部分。在下面的数据中，我只需要提取{{Infobox…}}部分中的数据。

{{Infobox XC Championships
|Name       = Senior men's race at the 2008 IAAF World Cross Country Championships
|Host city  = [[Edinburgh]], [[Scotland]], [[United Kingdom]] {{flagicon|United Kingdom}}
|Location   = [[Holyrood Park]]
|Nations participating  = 45
}}
2008.&lt;ref name=iaaf_00&gt;
{{ Citation 
| last = 
| publisher = [[IAAF]]
}}

所以在上面的例子中，我只需要提取

Infobox XC Championships
|Name       = Senior men's race at the 2008 IAAF World Cross Country Championships
|Host city  = [[Edinburgh]], [[Scotland]], [[United Kingdom]] {{flagicon|United Kingdom}}
|Location   = [[Holyrood Park]]
|Nations participating  = 45

请注意，{{Infobox…}}部分中可能有嵌套的 {{ }} 字符。我不想省略它。

下面是我的正则表达式：

\\{\\{Infobox[^{}]*\\}\\}

但是好像不管用。请帮帮忙。谢谢！

匿名用户

由于infobox部分的格式，实际上可以为此使用正则表达式。
技巧是，您甚至不需要处理嵌套的{{…}}元素，因为它们中的每一个都将在自己的行中以|开头。

{{(Infobox.*\r\n(?:\|.*\r\n)+)}}

Debuggex演示

{{           start of the string
  (Infobox   start of the capturing group
  .*\r\n     any characters until a line break appears
  (?:        
    \|       line has to start with a |
    .*\r\n   any characters until a line break appears
  )          
  +          the non-capturing group can occur multiple times
  )          end of capturing group
}}

因此，在Infobox-部分中，您只需匹配以|开头的行，直到弹出}}。

您可能必须根据您的平台/语言尝试\r\n。Debuggex对\r\n没问题，但regex101.com只会在\n上匹配

匿名用户

不要使用正则表达式…遵循此算法

1

2

3

4

从维基百科页面提取部分的正则表达式[重复]

共2个答案

相关问题

热门标签

从维基百科页面提取部分的正则表达式[重复]

共2个答案

相关问题

热门标签

微信关注