提问者:小点点

羽毛和镶木地板有什么区别?


两者都是用于数据分析系统的列式(磁盘)存储格式。两者都集成在Apache Arrow(python的py箭头包)中,旨在与Arrow对应作为列式内存分析层。

两种格式有何不同?

如果可能的话,在与熊猫一起工作时,你应该总是喜欢羽毛吗?

羽毛比镶木地板更合适的用例是什么?

附录

我https://github.com/wesm/feather/issues/188在这里找到了一些提示,但是考虑到这个项目的年轻时代,它可能有点过时了。

不是一个严肃的速度测试,因为我只是转储和加载整个Dataframe,但如果你以前从未听说过格式,我会给你一些印象:

 # IPython    
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import fastparquet as fp


df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]})

print("pandas df to disk ####################################################")
print('example_feather:')
%timeit feather.write_feather(df, 'example_feather')
# 2.62 ms ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_parquet:')
%timeit pq.write_table(pa.Table.from_pandas(df), 'example.parquet')
# 3.19 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("for comparison:")
print('example_pickle:')
%timeit df.to_pickle('example_pickle')
# 2.75 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_fp_parquet:')
%timeit fp.write('example_fp_parquet', df)
# 7.06 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit df.to_hdf('example_hdf', 'key_to_store', mode='w', table=True)
# 24.6 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("pandas df from disk ##################################################")
print('example_feather:')
%timeit feather.read_feather('example_feather')
# 969 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_parquet:')
%timeit pq.read_table('example.parquet').to_pandas()
# 1.9 ms ± 5.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

print("for comparison:")
print('example_pickle:')
%timeit pd.read_pickle('example_pickle')
# 1.07 ms ± 6.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_fp_parquet:')
%timeit fp.ParquetFile('example_fp_parquet').to_pandas()
# 4.53 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit pd.read_hdf('example_hdf')
# 10 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandas version: 0.22.0
# fastparquet version: 0.1.3
# numpy version: 1.13.3
# pandas version: 0.22.0
# pyarrow version: 0.8.0
# sys.version: 3.6.3
# example Dataframe taken from https://arrow.apache.org/docs/python/parquet.html

共2个答案

匿名用户

>

  • Parquet格式是为长期存储而设计的,其中Arrow更适合短期或短暂存储(在1.0.0版本发布后,Arrow可能更适合长期存储,因为那时二进制格式将是稳定的)

    Parquet比Feather编写成本更高,因为它具有更多的编码和压缩层。Feather是未修改的原始柱状箭头内存。我们可能会在将来为Feather添加简单的压缩。

    由于字典编码、RLE编码和数据页压缩,Parquet文件通常会比Feather文件小得多

    Parquet是许多不同系统支持的分析标准存储格式:Spark、Hive、Impala、各种AWS服务,将来由BigQuery等支持。因此,如果您正在进行分析,Parquet是多个系统查询的参考存储格式的好选择

    您显示的基准测试将非常嘈杂,因为您读取和写入的数据非常小。您应该尝试压缩至少100MB或超过1GB的数据以获得更多信息的基准测试,请参阅例如http://wesmckinney.com/blog/python-parquet-multithreading/

  • 匿名用户

    我还将在镶木地板和羽毛之间的比较中包括不同的压缩方法,以检查导入/导出速度以及它使用了多少存储空间。

    我主张为想要更好的csv替代方案的普通用户提供2种选择:

    • 带有“gzip”压缩的parquet(用于存储):导出比仅. csv快一点点(如果csv需要压缩,那么parquet快得多)。导入比csv快大约2倍。压缩比原始文件大小大约为22%,与压缩的csv文件大致相同。
    • 带有“zstd”压缩的feather(用于I/O速度):与csv相比,feather导出的导出速度提高了20倍,导入速度提高了约6倍。存储距离原始文件大小约为32%,比parquet“gzip”和csv压缩差10%,但仍然不错。

    与所有类别(I/O速度和存储)中的普通csv文件相比,两者都是更好的选择。

    我分析了以下格式:

    1. csv
    2. 使用“zip”压缩的csv
    3. 使用“zstd”压缩的羽毛
    4. 使用“lz4”压缩的羽毛
    5. 拼花地板使用“snappy”压缩
    6. 使用“gzip”压缩的拼花地板
    7. 拼花地板使用“brotli”压缩
    import zipfile
    import pandas as pd
    folder_path = (r"...\\intraday")
    zip_path = zipfile.ZipFile(folder_path + "\\AAPL.zip")    
    test_data = pd.read_csv(zip_path.open('AAPL.csv'))
    
    
    # EXPORT, STORAGE AND IMPORT TESTS
    # ------------------------------------------
    # - FORMAT .csv 
    
    # export
    %%timeit
    test_data.to_csv(folder_path + "\\AAPL.csv", index=False)
    # 12.8 s ± 399 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # storage
    # AAPL.csv exported using python.
    # 169.034 KB
    
    # import
    %%timeit
    test_data = pd.read_csv(folder_path + "\\AAPL.csv")
    # 1.56 s ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # ------------------------------------------
    # - FORMAT zipped .csv 
    
    # export
    %%timeit
    test_data.to_csv(folder_path + "\\AAPL.csv")
    # 12.8 s ± 399 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    # OBSERVATION: this does not include the time I spent manually zipping the .csv
    
    # storage
    # AAPL.csv zipped with .zip "normal" compression using 7-zip software.
    # 36.782 KB
    
    # import
    zip_path = zipfile.ZipFile(folder_path + "\AAPL.zip")
    %%timeit
    test_data = pd.read_csv(zip_path.open('AAPL.csv'))
    # 2.31 s ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # ------------------------------------------
    # - FORMAT .feather using "zstd" compression.
    
    # export
    %%timeit
    test_data.to_feather(folder_path + "\\AAPL.feather", compression='zstd')
    # 460 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # storage
    # AAPL.feather exported with python using zstd
    # 54.924 KB
    
    # import
    %%timeit
    test_data = pd.read_feather(folder_path + "\\AAPL.feather")
    # 310 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # ------------------------------------------
    # - FORMAT .feather using "lz4" compression.
    # Only works installing with pip, not with conda. Bad sign.
    
    # export
    %%timeit
    test_data.to_feather(folder_path + "\\AAPL.feather", compression='lz4')
    # 392 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # storage
    # AAPL.feather exported with python using "lz4"
    # 79.668 KB    
    
    # import
    %%timeit
    test_data = pd.read_feather(folder_path + "\\AAPL.feather")
    # 255 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # ------------------------------------------
    # - FORMAT .parquet using compression "snappy"
    
    # export
    %%timeit
    test_data.to_parquet(folder_path + "\\AAPL.parquet", compression='snappy')
    # 2.82 s ± 47.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # storage
    # AAPL.parquet exported with python using "snappy"
    # 62.383 KB
    
    # import
    %%timeit
    test_data = pd.read_parquet(folder_path + "\\AAPL.parquet")
    # 701 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # ------------------------------------------
    # - FORMAT .parquet using compression "gzip"
    
    # export
    %%timeit
    test_data.to_parquet(folder_path + "\\AAPL.parquet", compression='gzip')
    # 10.8 s ± 77.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # storage
    # AAPL.parquet exported with python using "gzip"
    # 37.595 KB
    
    # import
    %%timeit
    test_data = pd.read_parquet(folder_path + "\\AAPL.parquet")
    # 1.18 s ± 80.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # ------------------------------------------
    # - FORMAT .parquet using compression "brotli"
    
    # export
    %%timeit
    test_data.to_parquet(folder_path + "\\AAPL.parquet", compression='brotli')
    # around 5min each loop. I did not run %%timeit on this one.
    
    # storage
    # AAPL.parquet exported with python using "brotli"
    # 29.425 KB    
    
    # import
    %%timeit
    test_data = pd.read_parquet(folder_path + "\\AAPL.parquet")
    # 1.04 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    观察:

    • 羽毛似乎更适合轻量级数据,因为它写入和加载速度更快。Parquet具有更好的存储比率。
    • 羽毛库的支持和维护让我最初很担心,但是文件格式与熊猫有很好的集成,我可以使用conda为“zstd”压缩方法安装依赖项。
    • 到目前为止,最好的存储是带有“brotli”压缩的镶木地板,但是导出需要很长时间。一旦导出完成,它的导入速度很好,但导入速度仍然比羽毛慢2.5倍。