如何在过滤和添加文件名的同时加快导入许多csv的速度？

提问者：小点点

如何在过滤和添加文件名的同时加快导入许多csv的速度？

我有一些执行以下操作的Python（3.8）代码：

遍历给定路径的目录和子目录
查找所有.csv文件
查找文件名中带有“Pct”的所有.csv文件
联接路径和文件
读取 CSV 格式
将文件名添加到 df
将所有 dfs 放在一起

下面的代码可以工作，但需要很长时间（15分钟）才能摄取所有CSV - 有52，000个文件。这实际上可能不是很长一段时间，但我想尽可能地减少这种情况。

我目前的工作代码如下:

start_dirctory='/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/'  # change this
df_result= None
#loop_number = 0

for path, dirs, files in os.walk(start_dirctory):
        for file in sorted(fnmatch.filter(files, '*.csv')): # find .csv files
            # print(file)
            if 'Pct' in file: # filter if contains 'Pct'
                # print('Pct = ', file)
                full_name=os.path.join(path, file) # make full file path
                df_tmp= pd.read_csv(full_name, header=None) # read file to df_tmp
                df_tmp['file']=os.path.basename(file) # df.file = file name
                if df_result is None:
                    df_result= df_tmp
                else:
                    df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
                #print(full_name, 'imported')
                #loop_number = loop_number + 1
                #print('Loop number =', loop_number)

受这篇文章（递归查找文件）和这篇文章（如何加快导入csvs）的启发，我试图减少摄取所有数据所需的时间，但找不到一种方法来集成仅包含“Pct”的文件名的文件夹，然后将文件名添加到df中。这可能无法通过这些示例中的代码实现。

我在下面尝试了什么（不完整）：

%%time

import glob
import pandas as pd

df = pd.concat(
    [pd.read_csv(f, header=None)
     for f in glob.glob('/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/**/*.csv', recursive=True)
    ],
    axis='index', ignore_index=True
 )

问题

有没有什么方法可以减少阅读和摄取上面代码中CSV的时间？

谢谢！

共1个答案

匿名用户

请查看以下解决方案，这假设打开文件系统限制足够高，因为这将逐个流式传输每个文件，但它必须打开每个文件才能读取标头。如果文件具有不同的列，您将在生成的文件中获取它们的超集：

from convtools import conversion as c
from convtools.contrib.tables import Table

files = sorted(
    os.path.join(path, file)
    for path, dirs, files in os.walk(start_dirctory)
    for file in files
    if "Pct" in file and file.endswith(".csv")
)

table = None
for file in files:
    table_ = Table.from_csv(file, header=True)  # assuming there's header
    if table is None:
        table = table_
    else:
        table.chain(table_)

# this will be an iterable of dicts, so consume with pandas or whatever
table.into_iter_rows(dict)  # or list, or tuple

# or just write the new file like:
# >>> table.into_csv("concatenated.csv")
# HOWEVER: into_* can only be used once, because Table
# cannot assume the incoming data stream can be read twice

如果您确定所有文件都具有相同的列（一次打开一个文件）：

编辑以添加文件列

def concat_files(files):
    for file in files:
        yield from Table.from_csv(file, header=True).update(
            file=file
        ).into_iter_rows(dict)

# this will be an iterable of dicts, so consume with pandas or whatever
concat_files(files)

另外，你当然可以把Table.from_csv换成标准/其他的阅读器，但是这个阅读器适应文件，所以在大文件上通常更快。

如何在过滤和添加文件名的同时加快导入许多csv的速度？

共1个答案

相关问题

热门标签

如何在过滤和添加文件名的同时加快导入许多csv的速度？

共1个答案

相关问题

热门标签

微信关注