我正在尝试将Parquet数据加载到Google BigQuery中,以利用高效的列式格式,并(我希望)绕过BigQuery在AVRO文件中缺乏对逻辑类型(DATE等)的支持。
我的数据包含两级嵌套数组。
使用JSON我可以创建并加载具有所需结构的表:
bq mk temp.simple_interval simple_interval_bigquery_schema.json
bq load --source_format=NEWLINE_DELIMITED_JSON temp.simple_interval ~/Desktop/simple_interval.json
bq show temp.simple_interval
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels
----------------- ---------------------------------------- ------------ ------------- ------------ ------------------- --------
09 May 13:21:56 |- file_name: string (required) 3 246
|- file_created: timestamp (required)
|- id: string (required)
|- interval_length: integer (required)
+- days: record (repeated)
| |- interval_date: date (required)
| |- quality: string (required)
| +- values: record (repeated)
| | |- interval: integer (required)
| | |- value: float (required)
我尝试使用AvroParquetWriter创建具有Parquet数据文件的相同结构。我的AVRO模式是:
{
"name": "simple_interval",
"type": "record",
"fields": [
{"name": "file_name", "type": "string"},
{"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
{"name": "id", "type": "string"},
{"name": "interval_length", "type": "int"},
{"name": "days", "type": {
"type": "array",
"items": {
"name": "days_record",
"type": "record",
"fields": [
{"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
{"name": "quality", "type": "string"},
{"name": "values", "type": {
"type": "array",
"items": {
"name": "values_record",
"type": "record",
"fields": [
{"name": "interval", "type": "int"},
{"name": "value", "type": "float"}
]
}
}}
]
}
}}
]
}
从AVRO规范和我在网上发现的情况来看,似乎有必要以这种方式将“记录”节点嵌套在“数组”节点中。
当我创建Parquet文件时,Parquet工具将模式报告为:
message simple_interval {
required binary file_name (UTF8);
required int64 file_created (TIMESTAMP_MILLIS);
required binary id (UTF8);
required int32 interval_length;
required group days (LIST) {
repeated group array {
required int32 interval_date (DATE);
required binary quality (UTF8);
required group values (LIST) {
repeated group array {
required int32 interval;
required float value;
}
}
}
}
}
我将文件加载到BigQuery并检查结果:
bq load --source_format=PARQUET temp.simple_interval ~/Desktop/simple_interval.parquet
bq show temp.simple_interval
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels
----------------- --------------------------------------------- ------------ ------------- ------------ ------------------- --------
09 May 13:05:54 |- file_name: string (required) 3 246
|- file_created: timestamp (required)
|- id: string (required)
|- interval_length: integer (required)
+- days: record (required)
| +- array: record (repeated) <-- extra column
| | |- interval_date: date (required)
| | |- quality: string (required)
| | +- values: record (required)
| | | +- array: record (repeated) <-- extra column
| | | | |- interval: integer (required)
| | | | |- value: float (required)
这是可行的,但我想知道,有没有办法避免额外的“数组”中间节点/列?
我错过了什么吗?对于嵌套数组,AVRO/Parquet有没有办法像JSON一样获得更简单的BigQuery表结构?
我使用了这个avro模式:
{
"name": "simple_interval",
"type": "record",
"fields": [
{"name": "file_name", "type": "string"},
{"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
{"name": "id", "type": "string"},
{"name": "interval_length", "type": "int"},
{"name": "days", "type": {"type":"record","name":"days_", "fields": [
{"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
{"name": "quality", "type": "string"},
{"name": "values", "type": {"type":"record", "name":"values_","fields": [
{"name": "interval", "type": "int"},
{"name": "value", "type": "float"}
]}}
]}}
]
}
我用它创建了一个空的avro文件并运行命令:
bq load --source_format=AVRO <dataset>.<table-name> <avro-file>.avro
运行bq show时
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels kmsKeyName
----------------- ----------------------------------------- ------------ ------------- ------------ ------------------- -------- ------------
22 May 09:46:02 |- file_name: string (required) 0 0
|- file_created: integer (required)
|- id: string (required)
|- interval_length: integer (required)
+- days: record (required)
| |- interval_date: integer (required)
| |- quality: string (required)
| +- values: record (required)
| | |- interval: integer (required)
| | |- value: float (required)