如何使用pyarrow向拼花地板文件添加/更新?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})
pq.write_table(table2, './dataNew/pqTest2.parquet')
#append pqTest2 here?
我在文档中找不到任何关于附加拼花文件的内容。此外,您是否可以将pyarrow与多处理一起使用来插入/更新数据。
我遇到了同样的问题,我想我能够使用以下方法解决它:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
chunksize=10000 # this is the number of lines
pqwriter = None
for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):
table = pa.Table.from_pandas(df)
# for the first chunk of records
if i == 0:
# create a parquet write object giving it an output file
pqwriter = pq.ParquetWriter('sample.parquet', table.schema)
pqwriter.write_table(table)
# close the parquet writer
if pqwriter:
pqwriter.close()
在您的情况下,列名不一致,我使三个示例数据帧的列名一致,以下代码对我有效。
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def append_to_parquet_table(dataframe, filepath=None, writer=None):
"""Method writes/append dataframes in parquet format.
This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked
with writer, it appends dataframe to the already written pyarrow table.
:param dataframe: pd.DataFrame to be written in parquet format.
:param filepath: target file location for parquet file.
:param writer: ParquetWriter object to write pyarrow tables in parquet format.
:return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame
in the pyarrow Table
"""
table = pa.Table.from_pandas(dataframe)
if writer is None:
writer = pq.ParquetWriter(filepath, table.schema)
writer.write_table(table=table)
return writer
if __name__ == '__main__':
table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
writer = None
filepath = '/tmp/verify_pyarrow_append.parquet'
table_list = [table1, table2, table3]
for table in table_list:
writer = append_to_parquet_table(table, filepath, writer)
if writer:
writer.close()
df = pd.read_parquet(filepath)
print(df)
输出:
one three two
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
一般来说,拼花地板数据集由多个文件组成,因此可以通过将其他文件写入数据所属的同一目录来进行追加。能够轻松地连接多个文件将非常有用。我打开了https://issues.apache.org/jira/browse/PARQUET-1154为了使这在C(因此也是Python)中轻松实现