提问者:小点点

基于文本列表创建新列


例如,我有一个关于体育的列表:

sports = ["basketball", "football", "baseball"]

和一个带有一些句子的一列数据帧,

column_1
df
My favourite sport is football
I love to play basketball
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal

我想阅读列表,以便根据列中是否包含这些单词创建第二列。 见下文

df                                                    other
My favourite sport is football                        football
I love to play basketball                             basketball
Football is a family of team sports that involve..    football

我不想使用if语句,因为我的列表包含几乎50个不同的单词。 谢谢。


共3个答案

匿名用户

尝试此操作,str.extract

import re

sports = ["basketball", "football", "baseball"]

extract_ = re.compile("(%s)" % "|".join(sports), re.IGNORECASE)
df['extract'] = df.column_1.str.extract("(%s)" % "|".join(sports))
0    football
1  basketball
2    Football

匿名用户

df = pd.DataFrame()

df['column_1'] = ['My favourite sport is football', 'I love to play basketball', 'Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal']

sports = ["basketball", "football", "baseball"]

list_output = []

for i in range(len(df)):
    
    sentence = df['column_1'].iloc[i]
    for s in sports:
        if s.lower() in sentence.lower(): #s.lower is to avoid missing entries because they're upper case. So I'm comparing then all as lower case
            list_output.append(s)
    
df['sport'] = list_output

匿名用户

用这个。 这直截了当,通俗易懂--

df['other'] = df['column1'].apply(lambda x: list(set(x.lower().split()).intersection(set(sports)))[0])
  1. 这将应用一个函数,该函数首先将句子降格,然后将其拆分为单词
  2. 然后它需要句子中的单词集和体育列表中的单词集的交集。
  3. 如果每个句子可以有多个运动项目,则删除末尾的[0]以获得运动项目列表
    column1                         other
0   My favourite sport is football  football
1   I love to play basketball       basketball
2   Football is a family of t...    football