提问者:小点点

在DF中为每一行添加唯一组,包括来自其他列的总和


我得到了一个数据帧,看起来是这样的:

ID     field_1     area_1    field_2       area_2    field_3     area_3    field_4      area_4
1      scoccer     500       basketball    200       swimming    100       basketball   50
2      volleyball  100       np.nan        np.nan    np.nan      np.nan    np.nan       np.nan
3      basketball  1000      football      10        np.nan      np.nan    np.nan       np.nan
4      swimming    280       swimming      200       basketball  320       np.nan       np.nan
5      volleyball  110       football      160       volleyball  30        np.nan       np.nan 

原始数据帧具有相同的结构,但包含列field_1到field_30以及列area_1到Area_30。

我想根据“field_x”中不同的表达式,将具有水平组的列添加到DF中,并将相应的DIG区域相加。。。 添加的列应如下所示:

ID   group_1     area_1     group_2     area_2     group_3    area_3
        
1    scoccer     500        basketball  250        swimming   100
2    volleyball  100 
3    basketball  1000       football    10
4    swimming    480        basketball  320         
5    volleyball  140        football    160

有没有一个简单的方法来实现这一点?


共1个答案

匿名用户

使用pd.wide_to_long重塑数据框架,这允许您按字段和ID分组,并对区域进行求和。 然后,在使用CumCount创建列标签之后,PIVOT_TABLE返回到宽格式。

df = (pd.wide_to_long(df, i='ID', j='num', stubnames=['field', 'area'], sep='_')
        .groupby(['ID', 'field'])['area'].sum()
        .reset_index())
#   ID       field    area
#0   1  basketball   250.0
#1   1     scoccer   500.0
#2   1    swimming   100.0
#3   2  volleyball   100.0
#4   3  basketball  1000.0
#5   3    football    10.0
#6   4  basketball   320.0
#7   4    swimming   480.0
#8   5    football   160.0
#9   5  volleyball   140.0

df['idx'] = df.groupby('ID').cumcount()+1
df = (pd.pivot_table(df, index='ID', columns='idx', values=['field', 'area'], 
                     aggfunc='first')
        .sort_index(axis=1, level=1))
df.columns = ['_'.join(map(str, tup)) for tup in df.columns]
    area_1     field_1  area_2     field_2  area_3   field_3
ID                                                          
1    250.0  basketball   500.0     scoccer   100.0  swimming
2    100.0  volleyball     NaN         NaN     NaN       NaN
3   1000.0  basketball    10.0    football     NaN       NaN
4    320.0  basketball   480.0    swimming     NaN       NaN
5    160.0    football   140.0  volleyball     NaN       NaN

为了好玩,您可以使用未文档化的pd.lreshape而不是wide_to_long

# Change range to (1,31) for your real data.
pd.lreshape(df, {'area': [f'area_{i}' for i in range(1,5)],
                 'field': [f'field_{i}' for i in range(1,5)]}

#    ID    area       field
#0    1   500.0     scoccer
#1    2   100.0  volleyball
#2    3  1000.0  basketball
#3    4   280.0    swimming
#4    5   110.0  volleyball
#5    1   200.0  basketball
#....
#10   4   320.0  basketball
#11   5    30.0  volleyball
#12   1    50.0  basketball