>>> import pandas as pd
>>> df = pd.DataFrame({'Sentence':['his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm', 'I have researched the product KEY_abc_def, and KEY_blt_chm as requested', 'He got the idea from your message KEY_mno_pqr']})
>>> df
Sentence
0 This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm
1 I have researched the product KEY_abc_def, and KEY_blt_chm as requested
2 He got the idea from your message KEY_mno_pqr
我想使用正则表达式将KEY提取到一个没有实际KEY_的新列中。对于那些有超过1个KEY的句子,它们应该用逗号连接。输出应如下:
>>> df
Sentence KEY
0 This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm abc_def, mno_pqr, blt_chm
1 I have researched the product KEY_abc_def, and KEY_blt_chm as requested abc_def, blt_chm
2 He got the idea from your message KEY_mno_pqr mno_pqr
我尝试使用此代码,但它不起作用。如有任何建议,将不胜感激。
我目前只使用第一个键的代码,而忽略了其余的。我是新加入regex的,所以任何建议都将不胜感激。
df['KEY']= df.sentence.str.extract("KEY_(\w+)", expand=True)
使用
df['KEY']= df.sentence.str.findall("KEY_(\w+)").str.join(",")
Series.str.findall
查找捕获的子字符串的所有出现次数,并且str.join (",")
将结果连接到逗号分隔的字符串值中。
熊猫测试:
>>> df['KEY']= df['Sentence'].str.findall("KEY_(\w+)").str.join(",")
>>> df
Sentence KEY
0 his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm abc_def,mno_pqr,blt_chm
1 I have researched the product KEY_abc_def, and KEY_blt_chm as requested abc_def,blt_chm
2 He got the idea from your message KEY_mno_pqr mno_pqr
(注意,如果您不知道:我使用了pd.set\u选项('display.max\u colwidth',None)
来显示列中的所有数据,请参见如何在从pandas dataframe转换为html时以html显示完整(非截断)数据帧信息?)。