提问者:小点点

合并大数据框中的重复行[重复]


基本上,我有一个数据框,df

         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F
Pathway8    A         G           NA           NA           E
Pathway9    A         G           Z            H            F
Pathway6    A         G           Z            H            E
Pathway2    A         G           D            NA           F
Pathway5    A         G           D            NA           E
Pathway1    A         D           K            NA           F
Pathway7    A         B           C            D            F
Pathway4    A         B           C            D            E

现在我想合并这些行,如下所示:

newdf
      Beginning1    Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F, E
Pathway9    A         G           Z            H            F, E
Pathway2    A         G           D            NA           F, E
Pathway1    A         D           K            NA           F
Pathway4    A         B           C            D            F, E

这是我过去提出的问题(合并数据框中的重复行)的延续。这适用于此数据集,但对于我更大的数据集,它似乎无法组合值。例如,输出的前几行(在我修改了@Matt Jewett给出的代码或使用了Concatenate string by group with dplyr中提供的解释之后):

          Beginning1    Protein2    Protein3    Protein4    Biomarker1
Pathway1    Smoothened    Gl-1                              Osteopontin
Pathway2    Smoothened    Gl-1      BMP2                    Osteopontin
Pathway3    Smoothened    Gl-1      BMP2                    DLX5
Pathway4    Smoothened    Gl-1      BMP2                    Osteopontin

如您所见,有几个问题。首先,Biomarker1列似乎没有聚合。其次,有几行重复。我在解决方案方面遇到了障碍,所以你们能想到的任何解决方案都将不胜感激!

非常感谢你的帮助!


共1个答案

匿名用户

使用data. table足够简单

library(data.table)

dat <- fread("Pathway Beginning1 Protein2    Protein3    Protein4    Biomarker1
             Pathway3    A         G           NA           NA           F
             Pathway8    A         G           NA           NA           E
             Pathway9    A         G           Z            H            F
             Pathway6    A         G           Z            H            E
             Pathway2    A         G           D            NA           F
             Pathway5    A         G           D            NA           E
             Pathway1    A         D           K            NA           F
             Pathway7    A         B           C            D            F
             Pathway4    A         B           C            D            E")

dat_collapse <- dat[, .(Pathway = Pathway[1],
                        Biomarker1 = paste0(Biomarker1, collapse = ", ")),
                    by = .(Beginning1, Protein2, Protein3, Protein4)]

setcolorder(dat_collapse, names(dat))
dat_collapse 

结果在:

    Pathway Beginning1 Protein2 Protein3 Protein4 Biomarker1
1: Pathway3          A        G       NA       NA       F, E
2: Pathway9          A        G        Z        H       F, E
3: Pathway2          A        G        D       NA       F, E
4: Pathway1          A        D        K       NA          F
5: Pathway7          A        B        C        D       F, E