我不太明白一些分组和摘要是如何使用dplyr包在R中构建的。
下面是一个可重复的例子,我试图首先按(PN,GOT,HID)分组来计算PC1的不同实例。然后根据(PN,GOT)重新分组,根据第二个分组对PC1的不同实例进行求和。这个过程似乎适用于总和,除了平均值(TC),当我期望看到(PN,GOT)分组的平均值时,我得到了整个数据帧的平均值。我缺少什么来获得(PN,GOT)的平均值,同时不丢失我构建的PC1的总和?我希望能解释一下我在这里出错的地方。
PN<- c("Mazda","Mazda","Datsun","Hornet","Hornet","Valiant","Duster","Merc","Merc","Merc","Merc","Merc",
"Merc","Merc","Fiat","Honda","Toyota","Toyota","Dodge","AMC","Fiat")
GOT<- c("A","A","B","C","C","A","D","B","B","B","B","B","B","B","A","D","B","B","C","E","A")
HID<- c("Mazda_H1","Mazda_H1","Datsus_H1","Hornet_H1","Hornet_H2","Valiant_H1","Duster_H1","Merc_H1","Merc_H1","Merc_H1",
"Merc_H2","Merc_H2","Merc_H3","Merc_H4","Fiat_H1","Honda_H1","Toyota_H1","Toyota_H2","Dodge_H1","AMC_H1","Fiat_H1")
PIC<- c("BB","BB","BB","BB","AA","AA","AA","BA","BA","BA",
"AA","BB","BB","BB","BB","AA","AA","AA","BA","BA","BA")
TC <- c(110,110,93,175,175,105,245,62,62,62,62,62,62,62,33,52,97,97,150,150,33)
Int <- c(16.46,17.02,18.61,19.44,17.02,20.22,15.84,20.00,22.90,18.30,18.90,
17.40,17.60,18.00,19.47,18.52,19.90,20.01,16.87,17.30,18.90)
PC1<- c("", "","G1","C1","","G1","", "G1","G1","C1","C1","","","","Z1","Z1","Z1","Z1","","","G1")
df<-data.frame(PN,GOT,HID,PIC,TC,Int,PC1)
df
df%>% filter(PC1!="") %>%
group_by(PN, GOT, HID) %>%
summarize(new = n_distinct(PC1)) %>%
group_by(PN, GOT) %>%
mutate(TOT_new = sum(new),
meanTC = mean(TC))
我想我要找的答案是这样的:
PN GOT HID TOT_new meanTC
<fctr> <fctr> <fctr> <int> <dbl>
1 Datsun B Datsus_H1 1 93
2 Fiat A Fiat_H1 2 33
3 Honda D Honda_H1 1 52
4 Hornet C Hornet_H1 1 175
5 Merc B Merc_H1 3 62
6 Toyota B Toyota_H1 2 97
7 Valiant A Valiant_H1 1 105
或者至少是这样:
PN GOT HID new TOT_new meanTC
<fctr> <fctr> <fctr> <int> <int> <dbl>
1 Datsun B Datsus_H1 1 1 93
2 Fiat A Fiat_H1 2 2 33
3 Honda D Honda_H1 1 1 52
4 Hornet C Hornet_H1 1 1 175
5 Merc B Merc_H1 2 3 62
6 Merc B Merc_H2 1 3 62
7 Toyota B Toyota_H1 1 2 97
8 Toyota B Toyota_H2 1 2 97
9 Valiant A Valiant_H1 1 1 105
正如@r2evans所评论的,你得到全局平均值的原因是TC列在第一个汇总阶段被丢弃。除了评论中建议的join
选项,你还可以通过计算两个中间变量在第一个汇总阶段向前传递TC
列信息:
df %>% filter(PC1 != "") %>%
group_by(PN, GOT, HID) %>%
# create two columns with the sum and length of TC in each group which you can use later
# for average calculation
summarize(new = n_distinct(PC1), n = n(), TC_sum = sum(TC)) %>%
group_by(PN, GOT) %>%
summarise(TOT_new = sum(new), meanTC = sum(TC_sum)/sum(n))
# Source: local data frame [7 x 4]
# Groups: PN [?]
# PN GOT TOT_new meanTC
# <fctr> <fctr> <int> <dbl>
#1 Datsun B 1 93
#2 Fiat A 2 33
#3 Honda D 1 52
#4 Hornet C 1 175
#5 Merc B 3 62
#6 Toyota B 2 97
#7 Valiant A 1 105
我们也可以使用data. table
。将'data.frame'转换为'data.table'(setDT(df)
),指定'i'中的逻辑条件(PC1 ! = ""
),按'PN','GO','HID'分组,我们得到唯一
元素的长度
'PC1('new'),每组元素的数量(。n
),和TC的sum,然后按PN,“GOT”分组,我们分配“new”的sum和“TC_sum”的sum的比率,“n”的sum与“TOT_new”和“TC的含义”。将不需要的列分配给NULL
library(data.table)
setDT(df)[PC1 != "", .(new = uniqueN(PC1), n = .N, TC_sum = sum(TC)) ,.(PN, GOT, HID)
][, c("TOT_new", "meanTC") := .(sum(new), sum(TC_sum)/sum(n)) ,.(PN, GOT)
][, c("n", "TC_sum") := NULL][]
# PN GOT HID new TOT_new meanTC
#1: Datsun B Datsus_H1 1 1 93
#2: Hornet C Hornet_H1 1 1 175
#3: Valiant A Valiant_H1 1 1 105
#4: Merc B Merc_H1 2 3 62
#5: Merc B Merc_H2 1 3 62
#6: Fiat A Fiat_H1 2 2 33
#7: Honda D Honda_H1 1 1 52
#8: Toyota B Toyota_H1 1 2 97
#9: Toyota B Toyota_H2 1 2 97
根据上面的一些评论,这是使它发挥作用的一种方法。但是它看起来是多余的。
df%>% filter(PC1!="") %>%
group_by(PN, GOT, HID) %>%
summarize(new = n_distinct(PC1),
meanTC = mean(TC)) %>%
group_by(PN, GOT) %>%
mutate(TOT_new = sum(new),
meanTC = mean(meanTC))%>%
select(-HID)
PN GOT new meanTC TOT_new
<fctr> <fctr> <int> <dbl> <int>
1 Datsun B 1 93 1
2 Fiat A 2 33 2
3 Honda D 1 52 1
4 Hornet C 1 175 1
5 Merc B 2 62 3
6 Merc B 1 62 3
7 Toyota B 1 97 2
8 Toyota B 1 97 2
9 Valiant A 1 105 1