假设我有一个包含以下列的数据框:日期、时间、日期、月份、年份、描述、价格、类型、制造商
使用Pandas和value_counts()
,我可以获得列中每个唯一项目的计数:
df.manufacturer.value_counts()
此外,使用groupby,我可以在我的数据中获得每天的平均价格:
df.groupby("day").price.mean()
问题是总共有7天,但在我的数据中可能只有5或6天,所以我需要添加平均为零或无的缺失天数。
一般来说,如果我有一个特定的列表,当我执行value_counts或groupby操作时,如何包含丢失的项目?
我认为您可以将天数转换为分类
s,因此如果使用groupby均值
获取NaN
s用于缺失的类别:
df = pd.DataFrame({
'day': ['Monday','Tuesday','Tuesday','Tuesday','Thursday'],
'price': list(range(5))
})
print (df)
day price
0 Monday 0
1 Tuesday 1
2 Tuesday 2
3 Tuesday 3
4 Thursday 4
cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['day'] = pd.Categorical(df['day'], categories=cats, ordered=True)
print(df.groupby("day", as_index=False).price.mean())
day price
0 Monday 0.0
1 Tuesday 2.0
2 Wednesday NaN
3 Thursday 4.0
4 Friday NaN
5 Saturday NaN
6 Sunday NaN
另一种解决方案是按所有可能的类别重新索引:
cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
print(df.groupby("day").price.mean().reindex(cats))
day
Monday 0.0
Tuesday 2.0
Wednesday NaN
Thursday 4.0
Friday NaN
Saturday NaN
Sunday NaN
Name: price, dtype: float64
print(df.groupby("day").price.mean().reindex(cats, fill_value=0))
day
Monday 0
Tuesday 2
Wednesday 0
Thursday 4
Friday 0
Saturday 0
Sunday 0
Name: price, dtype: int64
您应该为此使用分类数据。下面是一个最小的例子。
import pandas as pd
df = pd.DataFrame([['Monday', 5], ['Monday', 6], ['Monday', 3],
['Tuesday', 1], ['Tuesday', 8]],
columns=['day', 'value'])
# list days in order
day_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# convert to categorical
df['day'] = df['day'].astype('category')
# set categories and ordered=True
df['day'] = df['day'].cat.set_categories(day_list, ordered=True)
# perform groupby and fillna with 0
res = df.groupby('day').mean().fillna(0)
结果:
value
day
Monday 4.666667
Tuesday 4.500000
Wednesday 0.000000
Thursday 0.000000
Friday 0.000000
Saturday 0.000000
Sunday 0.000000