我正在尝试使用下面的代码将索引列添加到数据集,将其转换为JavaPairRDD。
// ds is a Dataset<Row>
JavaPairRDD<Row, Long> indexedRDD = ds.toJavaRDD()
.zipWithIndex();
// Now I am converting JavaPairRDD to JavaRDD as below.
JavaRDD<Row> rowRDD = indexedRDD
.map(tuple -> RowFactory.create(tuple._1(),tuple._2().intValue()));
// I am converting the RDD back to dataframe and it doesnt work.
Dataset<Row> authDf = session
.createDataFrame(rowRDD, ds.schema().add("ID", DataTypes.IntegerType));
// Below is the ds schema(Before adding the ID column).
ds.schema()
root
|-- user: short (nullable = true)
|-- score: long (nullable = true)
|-- programType: string (nullable = true)
|-- source: string (nullable = true)
|-- item: string (nullable = true)
|-- playType: string (nullable = true)
|-- userf: integer (nullable = true)
上面的代码抛出以下错误消息:
**Job aborted due to stage failure: Task 0 in stage 21.0 failed 4
times, most recent failure: Lost task 0.3 in stage 21.0 (TID 658,
sl73caehdn0406.visa.com, executor 1):
java.lang.RuntimeException:
Error while encoding: java.lang.RuntimeException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is not
a valid external type for schema of smallint**
您在第二个语句中创建的元组由两列组成:一列是对象(由初始数据集中的所有colmns组成),第二列是整数。第二个元组列进入第二个结果列,它是long类型。第一个元组列进入第一个结果列,它是短类型-作为一个对象,即GenericRowWellSchema,这会导致错误。
您应该使用7个参数来执行RowFactory. create(),每个结果列一个。