所以我有一些文本文档,看起来像这样:
1a Title
Subtitle
Description
1b Title
Subtitle A
Description
Subtitle B
Description
2 Title
Subtitle A
Description
Subtitle B
Description
Subtitle C
Description
我正试图用正则表达式捕捉由3个选项卡缩进的“描述”行。我遇到的问题是,有时描述行将换行到下一行,并再次被3个制表符缩进。以下是一个例子:
1 Demo
Example
This is the description text body that I am
trying to capture with regex.
我想把这篇文章分成一组,最后是:
This is the description text body that I am trying to capture with regex.
一旦我能够做到这一点,我还想“展平”的文件,使每个部分在一行字符分隔,而不是行和制表符。因此,我的示例代码将变成:
1->Demo->->Example->->->This is the description text...
我将在Python中实现这一点,但任何正则表达式的指导都将不胜感激!
UPTADE
我已经改变了扁平化文本中的分隔符,以表明它是以前的关系。
此外,如果每个标题(章节)都有多个字幕(小节),则扁平文本的外观如下:
1a-
基本上只是为每个孩子“重用”父母(编号/标题)(副标题)。
您可以在不使用正则表达式的情况下执行此操作:
txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
\t\tSep
\t\t\tAnd Another Section
\t\t\tOn two lines
'''
cap=[]
buf=[]
for line in txt.splitlines():
if line.startswith('\t\t\t'):
buf.append(line.strip())
continue
if buf:
cap.append(' '.join(buf))
buf=[]
else:
if buf:
cap.append(' '.join(buf))
print cap
印刷品:
['This is the description text body that I am trying to capture with regex.',
'And Another Section On two lines']
其优点是,单独缩进3个标签的不同部分仍然是可分离的。
好:下面是正则表达式中的完整解决方案:
txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
2\tSecond Demo
\t\tAnother Section
\t\t\tAnd Another 3rd level Section
\t\t\tOn two lines
3\tNo section below
4\tOnly one level below
\t\tThis is that one level
'''
import re
result=[]
for ms in re.finditer(r'^(\d+.*?)(?=^\d|\Z)',txt,re.S | re.M):
section=ms.group(1)
tm=map(len,re.findall(r'(^\t+)', section, re.S | re.M))
subsections=max(tm) if tm else 0
sec=[re.search(r'(^\d+.*)', section).group(1)]
if subsections:
for i in range(2,subsections+1):
lt=r'^{}([^\t]+)$'.format(r'\t'*i)
level=re.findall(lt, section, re.M)
sec.append(' '.join(s.strip() for s in level))
print '->'.join(sec)
印刷品:
1 Demo->Example->This is the description text body that I am trying to capture with regex.
2 Second Demo->Another Section->And Another 3rd level Section On two lines
3 No section below
4 Only one level below->This is that one level
1) This is limited to the format you described.
2) It will not handle reverse levels properly:
1 Section
Second Level
Third Level
Second Level Again <== This would be jammed in with 'second level'
How would you handel multi levels?
3) Won't handle multiline section headers:
3 Like
This
在您的完整示例中运行此示例:
1a Title->Subtitle->Description Second Line of Description
1b Title->Subtitle A Subtitle B->Description Description
2 Title->Subtitle A Subtitle B Subtitle C->Description Description Description
您可以看到第二级和第三级是join,但我不知道您希望如何处理该格式。
这个怎么样?
re.findall(r'(?m)((?:^\t{3}.*?\n)+)', doc)
它还将捕获制表符和换行符,但这些可以稍后删除。
使用 re
python2:
text = "yourtexthere"
lines = re.findall("\t{3}.+", text)
不带制表符“\t”
:
text = "yourtexthere"
lines = [i[3:] for i in re.findall("\t{3}.+", text)]
要获得最终输出,请执行以下操作:
...<br>
"\n".join(lines)
还不是很好,但我正在努力:
import re
text = "..."
out = [i for i in re.findall("\t{2,3}.+", text.replace(" ", "\t"))]
fixed = []
sub = []
for i in out:
if not i.startswith("\t"*3):
if sub: fixed.append(tuple(sub)); sub = []
else:
sub.append(i)
if sub:
fixed.append(tuple(sub))
print fixed