提问者:小点点

用于捕获多行文本正文的正则表达式


所以我有一些文本文档,看起来像这样:

1a  Title
        Subtitle
            Description
1b  Title
        Subtitle A
            Description
        Subtitle B
            Description
2   Title
        Subtitle A
            Description
        Subtitle B
            Description
        Subtitle C
            Description

我正试图用正则表达式捕捉由3个选项卡缩进的“描述”行。我遇到的问题是,有时描述行将换行到下一行,并再次被3个制表符缩进。以下是一个例子:

1   Demo
        Example
            This is the description text body that I am
            trying to capture with regex.

我想把这篇文章分成一组,最后是:

This is the description text body that I am trying to capture with regex.

一旦我能够做到这一点,我还想“展平”的文件,使每个部分在一行字符分隔,而不是行和制表符。因此,我的示例代码将变成:

1->Demo->->Example->->->This is the description text...

我将在Python中实现这一点,但任何正则表达式的指导都将不胜感激!


UPTADE
我已经改变了扁平化文本中的分隔符,以表明它是以前的关系。

此外,如果每个标题(章节)都有多个字幕(小节),则扁平文本的外观如下:

1a-

基本上只是为每个孩子“重用”父母(编号/标题)(副标题)。


共3个答案

匿名用户

您可以在不使用正则表达式的情况下执行此操作:

txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
\t\tSep
\t\t\tAnd Another Section
\t\t\tOn two lines
'''

cap=[]
buf=[]
for line in txt.splitlines():
    if line.startswith('\t\t\t'):
        buf.append(line.strip())
        continue
    if buf:    
        cap.append(' '.join(buf))
        buf=[]
else:
    if buf:    
        cap.append(' '.join(buf))      

print cap

印刷品:

['This is the description text body that I am trying to capture with regex.', 
 'And Another Section On two lines']

其优点是,单独缩进3个标签的不同部分仍然是可分离的。

好:下面是正则表达式中的完整解决方案:

txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
2\tSecond Demo
\t\tAnother Section
\t\t\tAnd Another 3rd level Section
\t\t\tOn two lines
3\tNo section below
4\tOnly one level below
\t\tThis is that one level
'''

import re

result=[]
for ms in re.finditer(r'^(\d+.*?)(?=^\d|\Z)',txt,re.S | re.M):
    section=ms.group(1)
    tm=map(len,re.findall(r'(^\t+)', section, re.S | re.M))
    subsections=max(tm) if tm else 0
    sec=[re.search(r'(^\d+.*)', section).group(1)]
    if subsections:
        for i in range(2,subsections+1):
            lt=r'^{}([^\t]+)$'.format(r'\t'*i)
            level=re.findall(lt, section, re.M)
            sec.append(' '.join(s.strip() for s in level))

    print '->'.join(sec)

印刷品:

1   Demo->Example->This is the description text body that I am trying to capture with regex.
2   Second Demo->Another Section->And Another 3rd level Section On two lines
3   No section below
4   Only one level below->This is that one level
1) This is limited to the format you described.
2) It will not handle reverse levels properly:
    1 Section 
         Second Level
             Third Level
         Second Level Again       <== This would be jammed in with 'second level'
    How would you handel multi levels?

3) Won't handle multiline section headers:

    3    Like
         This

在您的完整示例中运行此示例:

1a  Title->Subtitle->Description Second Line of Description
1b  Title->Subtitle A Subtitle B->Description Description
2   Title->Subtitle A Subtitle B Subtitle C->Description Description Description

您可以看到第二级和第三级是join,但我不知道您希望如何处理该格式。

匿名用户

这个怎么样?

re.findall(r'(?m)((?:^\t{3}.*?\n)+)', doc)

它还将捕获制表符和换行符,但这些可以稍后删除。

匿名用户

使用 repython2:

text = "yourtexthere"
lines = re.findall("\t{3}.+", text)

不带制表符“\t”

text = "yourtexthere"
lines = [i[3:] for i in re.findall("\t{3}.+", text)]

要获得最终输出,请执行以下操作:

...<br>
"\n".join(lines)

还不是很好,但我正在努力:

import re
text = "..."
out = [i for i in re.findall("\t{2,3}.+", text.replace("    ", "\t"))]
fixed = []
sub = []
for i in out:
    if not i.startswith("\t"*3):
        if sub: fixed.append(tuple(sub)); sub = []
    else:
        sub.append(i)
if sub:
    fixed.append(tuple(sub))
print fixed