Impala在创建分区表时失败，因为拼花文件损坏

提问者：小点点

Impala在创建分区表时失败，因为拼花文件损坏

我正在使用Dask将分区拼花文件保存在S3存储桶上：

dd.to_parquet(
    dd.from_pandas(df, npartitions=1),
    path='s3a://test/parquet',
    engine='fastparquet',
    partition_on='country',
    object_encoding='utf8',
    compression="gzip",
    write_index=False,
)

Parquet文件被成功创建；这里是目录结构：目录结构

我成功地从这个拼花地板创建了一个Impala表：

create external table tmp.countries_france
like parquet 's3a://test/parquet/_metadata'
partitioned by (country string)
stored as parquet location 's3a://test/parquet/'

以及向该表添加分区：

alter table tmp.countries_france add partition (sheet='belgium')

但是，当我做一个select*from tmp.countries_france我得到以下错误：

文件s3a：//test/parquet/工作表=法国/part.0. parquet损坏：元数据表示零行数，但至少有一个非空行组。

我想问题来自Dask，因为当我创建一个非分区拼花时，它工作得很好。我尝试过设置write_index=True，但没有成功。

共1个答案

匿名用户

我没看到这个

df = pd.DataFrame({'a': np.random.choice(['a', 'b', 'c'], size=1000),
                   'b': np.random.randint(0, 64000, size=1000),
                   'c': np.random.choice([True, False], size=1000)})
writer.write(tempdir, df, partition_on=['a', 'c'], file_scheme=scheme)
df = dd.from_pandas(df, npartitions=1)
df.to_parquet('.', partition_on=['a', 'c'], engine='fastparquet')

pf = fastparquet.ParquetFile('_metadata')
pf.count  # 1000
len(pf.to_pandas())  # 1000
pf.row_groups[0].num_rows  # 171

pf = fastparquet.ParquetFile('a=a/c=False/part.0.parquet')
pf.count # 171
pf.row_groups[0].num_rows  # 171

显然，我不能说impala可能在做什么——但是也许“喜欢”机制期望在_metadata文件中找到数据？

请注意，熊猫可以使用相同的选项在没有Dask的情况下写入/从镶木地板写入。

Impala在创建分区表时失败，因为拼花文件损坏

共1个答案

相关问题

热门标签

Impala在创建分区表时失败，因为拼花文件损坏

共1个答案

相关问题

热门标签

微信关注