特定的熊猫列作为df.apply输出新列中的参数

提问者：小点点

特定的熊猫列作为df.apply输出新列中的参数

给定一个数据帧，如下所示：

import pandas as pd
from sklearn.metrics import mean_squared_error

    df = pd.DataFrame.from_dict(  
         {'row': ['a','b','c','d','e','y'],
            'a': [ 0, -.8,-.6,-.3, .8, .01],
            'b': [-.8,  0, .5, .7,-.9, .01],
            'c': [-.6, .5,  0, .3, .1, .01],
            'd': [-.3, .7, .3,  0, .2, .01],
            'e': [ .8,-.9, .1, .2,  0, .01],
            'y': [ .01, .01, .01, .01,  .01, 0],
       }).set_index('row')
df.columns.names = ['col']

我想使用参数的特定列创建一个新的RMSE值列（来自scikit learn）。也就是说，列y_true=df['a'，'b'，'c']vsy_pred=df['x'，'y'，'x']。使用迭代方法很容易做到这一点：

for tup in df.itertuples():
    df.at[tup[0], 'rmse']  = mean_squared_error(tup[1:4], tup[4:7])**0.5

这就得到了期望的结果：

col     a     b     c     d     e     y      rmse
row                                              
a    0.00 -0.80 -0.60 -0.30  0.80  0.01  1.003677
b   -0.80  0.00  0.50  0.70 -0.90  0.01  1.048825
c   -0.60  0.50  0.00  0.30  0.10  0.01  0.568653
d   -0.30  0.70  0.30  0.00  0.20  0.01  0.375988
e    0.80 -0.90  0.10  0.20  0.00  0.01  0.626658
y    0.01  0.01  0.01  0.01  0.01  0.00  0.005774

但是我想要一个更高性能的解决方案，可能使用矢量化，因为我的数据帧有形状（180000000, 52)。我也不喜欢按元组位置而不是按列名索引。以下尝试：

df['rmse'] = df.apply(mean_squared_error(df[['a','b','c']], df[['d','e','y']])**0.5, axis=1)

获取错误：

TypeError: ("'numpy.float64' object is not callable", 'occurred at index a')

那么我使用df有什么不对呢。应用（）？这甚至可以在迭代过程中最大化性能吗？

我已经使用以下测试df测试了前两个响应者的墙时间：

# set up test df
dim_x, dim_y = 50, 1000000
cols = ["a_"+str(i) for i in range(1,(dim_x//2)+1)]
cols_b = ["b_"+str(i) for i in range(1,(dim_x//2)+1)]
cols.extend(cols_b)
shuffle(cols)
df = pd.DataFrame(np.random.uniform(0,10,[dim_y, dim_x]), columns=cols)  #, index=idx, columns=cols
a = df.values

# define column samples
def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

c0 = [s for s in cols if "a" in s]
c1 = [s for s in cols if "b" in s]
s0 = a[:,column_index(df, c0)]
s1 = a[:,column_index(df, c1)]

结果如下：

%%time
# approach 1 - divakar
rmse_out = np.sqrt(((s0 - s1)**2).mean(1))
df['rmse_out'] = rmse_out

Wall time: 393 ms

%%time
# approach 2 - divakar
diffs = s0 - s1
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'] = rmse_out

Wall time: 228 ms

%%time
# approach 3 - divakar
diffs = s0 - s1
rmse_out = np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
         np.einsum('ij,ij->i',s1,s1) - \
       2*np.einsum('ij,ij->i',s0,s1))/3.0)
df['rmse_out'] = rmse_out

Wall time: 421 ms

使用应用函数的解决方案在几分钟后仍在运行...

共2个答案

匿名用户

方法#1

一种提高性能的方法是将底层数组数据与NumPy ufuns一起使用，同时将这两个列块切片，以矢量化的方式使用这些ufuns，就像这样-

a = df.values
rmse_out = np.sqrt(((a[:,0:3] - a[:,3:6])**2).mean(1))
df['rmse_out'] = rmse_out

进近#2

用np.einsum代替平方求和来计算RMSE值的另一种更快的方法-

diffs = a[:,0:3] - a[:,3:6]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)

方法#3

另一种计算rmse_out的方法是使用以下公式：

（a-b）^2=a^2 b^2-2ab

将是提取切片：

s0 = a[:,0:3]
s1 = a[:,3:6]

那么，rmse_out将是-

np.sqrt(((s0**2).sum(1) + (s1**2).sum(1) - (2*s0*s1).sum(1))/3.0)

使用einsum将成为-

np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
         np.einsum('ij,ij->i',s1,s1) - \
       2*np.einsum('ij,ij->i',s0,s1))/3.0)

获取各自的列索引

如果您不确定列a， b，...是否按此顺序排列，我们可以找到具有column_index的索引。

因此，a[：，0:3]将被a[：，column_index（df，['a'，'b'，'c']）和a[：，3:6]替换为a[：，column_index（df，['d'，'e'，'y']）。


                        

                
                    匿名用户

                




                
					
df.apply办法：
df['rmse'] = df.apply(lambda x: mean_squared_error(x[['a','b','c']], x[['d','e','y']])**0.5, axis=1)

col     a     b     c     d     e     y      rmse
row                                              
a    0.00 -0.80 -0.60 -0.30  0.80  0.01  1.003677
b   -0.80  0.00  0.50  0.70 -0.90  0.01  1.048825
c   -0.60  0.50  0.00  0.30  0.10  0.01  0.568653
d   -0.30  0.70  0.30  0.00  0.20  0.01  0.375988
e    0.80 -0.90  0.10  0.20  0.00  0.01  0.626658
y    0.01  0.01  0.01  0.01  0.01  0.00  0.005774


		      
                相关问题
                

																                
					
										   Android：在模块jefied-play-services-测量和jefied-play-services-测量-impl中发现重复类
										   在Hashmap中查找匹配的键/值对
										   如何迭代Hashmap并与同一Hashmap中的其他键进行组合以比较它们的对象
										   HashCode-如果相等的对象碰巧在同一个桶中散列会发生什么？
										   如何防止对数组中类对象的重复引用？
										   JavaHashMap内部数据结构在重新散列期间如何变化？
										   hashmap如何识别何时需要重新散列
										   HashMap基于大小的重新散列
										   如何以及何时在HashMap中完成重新散列
										   散列码的分布如何影响Java的HashMap何时重新散列？
										   在hashmap或hashtable中重新散列的成本
										   HashMap如何识别内部数组中的哪些位置包含元素？
										   当HashMap增加其大小时，HashMap中值的索引会发生什么？
										   @BeforeClass在ktor测试类中不工作
										   Jest vanilla JavaScript JSDOM刷新失败，切换beforeAll到before每一个后的第二次测试中断
										   在笑话中，定义全局变量是否与在BeforeAll中定义相同？
										   静态编程语言中@BeforeAll的正确解决方法是什么
										   线程“main”java. lang.NoClassDefFoundError中的异常：在Intellij[Spring boot]中
										   线程“main”java. lang.NoClassDefFoundError中的异常：org/apache/log4j/ProvisionNode
										   异步管道是否从服务中定义并从组件变量指向的可观察对象取消订阅？

特定的熊猫列作为df.apply输出新列中的参数

共2个答案

相关问题

热门标签

微信关注