通过使用pyarrow按大小重新分区pyarrow表并写入多个parquet文件？

提问者：小点点

通过使用pyarrow按大小重新分区pyarrow表并写入多个parquet文件？

正如标题所说，我想通过使用pyarrow并写入几个parquet文件来按大小（或行组大小）重新分区pyarrow表。

我查看了pyarrow留档，并确定了分区数据集章节，这似乎是一个方向。不幸的是，它表明按列内容分区是可能的，但不能按大小（或行组大小）分区。

那么，从一个表开始，我如何控制写入步骤，以便以受控的大小xMB写入几个文件？（或行组大小）

import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

file = 'example.parquet'
file_res = 'example_res'

# Generate a random df
df = pd.DataFrame(np.random.randint(100,size=(100000, 20)),columns=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])
table = pa.Table.from_pandas(df)

# With this command, I can write a single parquet file that contains 2 row groups.
pq.write_table(table, file, version='2.0', row_group_size=50000)

# I can read it back and try to write it as a partitioned dataset, but a single parquet file is then written.
table_new = pq.ParquetFile(file).read()
pq.write_to_dataset(table_new, file_res)

谢谢你的帮助！最棒的，

共1个答案

匿名用户

查看write_to_dataset和ParquetWriter的文档，我想不出任何明显的东西。

但是您可以为每行分配一个存储桶并根据存储桶对数据进行分区，例如：

df = (
    pd.DataFrame(np.random.randint(100,size=(100000, 20)),columns=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])
    .assign(bucket=lambda x: x.index // 5000)
)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table_new, file_res, partition_cols=['bucket'])

您将获得以下文件结构：

bucket=0
bucket=1
bucket=10
bucket=11
bucket=12
bucket=13
bucket=14
bucket=15
bucket=16
bucket=17
bucket=18
bucket=19
bucket=2
bucket=3
bucket=4
bucket=5
bucket=6
bucket=7
bucket=8
bucket=9

这是假设您的df. index从零开始并逐个增加（0、1、2、3…）

通过使用pyarrow按大小重新分区pyarrow表并写入多个parquet文件？

共1个答案

相关问题

热门标签

通过使用pyarrow按大小重新分区pyarrow表并写入多个parquet文件？

共1个答案

相关问题

热门标签

微信关注