提问者:小点点

将tsv文件转换为pandas数据帧


好的,所以我正面临一个问题,从上周开始,我的tsv,我想修改和转换成一个熊猫数据框架。

所以我的文件是这样的:

'NC_011745.1_islands.csv': [['PAI 1 EaaA, EibA : 3.1'],
                             ['PAI 2 EaaA : 7.75'],
                             ['PAI 3 Capsule : 4.428571428571429'],
                             ['PAI 4 EaaA : 7.75'],
                             ['PAI 5 ipaH : 7.75'],
                             ['PAI 6 IreA, IrgA homolog adhesin (Iha) : '
                              '0.96875'],
                             ['PAI 7 IrgA homolog adhesin (Iha), Aerobactin : '
                              '0.8157894736842105'],
                             ['PAI 8 MsbB2, VirK : 2.8181818181818183'],
                             ['PAI 9 Antigen 43, AIDA-I type : '
                              '1.3478260869565217']],
 'NC_017632_islands.csv': [['PAI 1 Capsule : 15.857142857142858'],
                           ['PAI 2 AAI/SCI-II, direct heme uptake system, '
                            'Colibactin, Colibactin : 1.819672131147541'],
                           ['PAI 3 F9-like fimbriae, Type 1 fimbriae : '
                            '3.3636363636363638'],
                           ['PAI 4 Ferrous iron transport : 5.045454545454546'],
                           ['PAI 5 Cah, AIDA-I type, Salmochelin, S fimbriae : '
                            '2.707317073170732'],
                           ['PAI 6 ECP, Tsh : 13.875'],
                           ['PAI 7 ACE/AEC T6SS : 9.25'],
                           ['PAI 8 Tia/Hek, P fimbriae, F17-like fimbriae, '
                            'AAI/SCI-II, CNF-1, Alpha-hemolysin, '
                            'hemagglutinin-like adhesin : 1.088235294117647']],
 'NC_017646_islands.csv': [['PAI 1 Allantion utilization : 5.285714285714286'],
                           ['PAI 2 direct heme uptake system : 4.44'],
                           ['PAI 3 ipaH : 27.75'],
                           ['PAI 4 P fimbriae, Aerobactin, Sat, IrgA homolog '
                            'adhesin (Iha), K1 capsule, K1 capsule, T2SS : '
                            '1.3058823529411765'],
                           ['PAI 5 P fimbriae, Tia/Hek : 5.842105263157895'],
                           ['PAI 6 VirK, MsbB2 : 10.090909090909092']]}

我想修改它并将其导出为pandas数据框架,如下所示:

\             EaaA, EibA   EaaA   Capsule    ipaH    IreA, IrgA homolog adhesin (Iha)  ...
NC_011745.1     3.1        7.75    4.4285..  7.75                0.96875
NC_017632        NA         NA     15.8574   NA                  NA

对我来说主要的问题是把它作为一个数据帧,我试过:

df = pd.DataFrame([dict]).T
df.to_tsv()

但是它说这个功能不是和tsv一起工作,而是和csv一起工作

谢谢你帮助我,顺便说一句,对不起我的英语:)


共1个答案

匿名用户

你不能用熊猫开箱就做这件事--熊猫很好,但它不是魔法。 在数据准备好以所需格式生成数据帧之前,您将需要进行大量操作。 尝试如下所示:

_dict={'NC_011745.1_islands.csv': [['PAI 1 EaaA, EibA : 3.1'],
                             ['PAI 2 EaaA : 7.75'],
                             ['PAI 3 Capsule : 4.428571428571429'],
                             ['PAI 4 EaaA : 7.75'],
                             ['PAI 5 ipaH : 7.75'],
                             ['PAI 6 IreA, IrgA homolog adhesin (Iha) : '
                              '0.96875'],
                             ['PAI 7 IrgA homolog adhesin (Iha), Aerobactin : '
                              '0.8157894736842105'],
                             ['PAI 8 MsbB2, VirK : 2.8181818181818183'],
                             ['PAI 9 Antigen 43, AIDA-I type : '
                              '1.3478260869565217']],
 'NC_017632_islands.csv': [['PAI 1 Capsule : 15.857142857142858'],
                           ['PAI 2 AAI/SCI-II, direct heme uptake system, '
                            'Colibactin, Colibactin : 1.819672131147541'],
                           ['PAI 3 F9-like fimbriae, Type 1 fimbriae : '
                            '3.3636363636363638'],
                           ['PAI 4 Ferrous iron transport : 5.045454545454546'],
                           ['PAI 5 Cah, AIDA-I type, Salmochelin, S fimbriae : '
                            '2.707317073170732'],
                           ['PAI 6 ECP, Tsh : 13.875'],
                           ['PAI 7 ACE/AEC T6SS : 9.25'],
                           ['PAI 8 Tia/Hek, P fimbriae, F17-like fimbriae, '
                            'AAI/SCI-II, CNF-1, Alpha-hemolysin, '
                            'hemagglutinin-like adhesin : 1.088235294117647']],
 'NC_017646_islands.csv': [['PAI 1 Allantion utilization : 5.285714285714286'],
                           ['PAI 2 direct heme uptake system : 4.44'],
                           ['PAI 3 ipaH : 27.75'],
                           ['PAI 4 P fimbriae, Aerobactin, Sat, IrgA homolog '
                            'adhesin (Iha), K1 capsule, K1 capsule, T2SS : '
                            '1.3058823529411765'],
                           ['PAI 5 P fimbriae, Tia/Hek : 5.842105263157895'],
                           ['PAI 6 VirK, MsbB2 : 10.090909090909092']]}


f = {}
for key, a in _dict.items():
    e = {}
    for b in a:
        for c in b:
            d = c.split(" : ")
            d[0] = d[0].replace("PAI ", "")[2:]
            d = {d[0]:d[1]}
            e = {**e, **d}
    f[key] = e

df = pd.DataFrame.from_dict(f, 'index')

您需要找到一种健壮的方法来解析字符串--可能是regex--但这应该是您的入门。