在带有字符串字段的spark中使用数据帧作为决策树分类器

提问者：小点点

在带有字符串字段的spark中使用数据帧作为决策树分类器

我已经设法让我的决策树分类器为基于RDD的API工作，但现在我正在尝试切换到Spark中基于数据帧的API。

我有这样一个数据集（但有更多字段）：

国家，目的地，持续时间，标签

Belgium, France, 10, 0
Bosnia, USA, 120, 1
Germany, Spain, 30, 0

首先，我在数据框中加载csv文件：

val data = session.read
  .format("org.apache.spark.csv")
  .option("header", "true")
  .csv("/home/Datasets/data/dataset.csv")

然后我把字符串列转换成数字列

val stringColumns = Array("country", "destination")

val index_transformers = stringColumns.map(
  cname => new StringIndexer()
    .setInputCol(cname)
    .setOutputCol(s"${cname}_index")
)

然后，我使用VectorAssembler将我所有的功能组装成一个单一的向量，如下所示：

val assembler = new VectorAssembler()
   .setInputCols(Array("country_index", "destination_index", "duration_index"))
   .setOutputCol("features")

我将数据分为培训和测试：

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

然后我创建我的决策树分类器

val dt = new DecisionTreeClassifier()
  .setLabelCol("label")
  .setFeaturesCol("features")

然后我使用管道来进行所有的转换

val pipeline = new Pipeline()
  .setStages(Array(index_transformers, assembler, dt))

我训练我的模型并将其用于预测：

val model = pipeline.fit(trainingData)

val predictions = model.transform(testData)

但是我有一些我不明白的错误：

当我这样运行代码时，出现以下错误：

[error]  found   : Array[org.apache.spark.ml.feature.StringIndexer]
[error]  required: org.apache.spark.ml.PipelineStage
[error]           .setStages(Array(index_transformers, assembler,dt))

因此，我所做的是，我在索引_transformersval之后和val assembler之前添加了一个管道：

val index_pipeline = new Pipeline().setStages(index_transformers)
val index_model = index_pipeline.fit(data)
val df_indexed = index_model.transform(data)

我将新的df_索引数据帧用作训练集和测试集，并使用assembler和dt从管道中删除了索引_转换器

val Array(trainingData, testData) = df_indexed.randomSplit(Array(0.7, 0.3))

val pipeline = new Pipeline()
  .setStages(Array(assembler,dt))

我得到这个错误：

Exception in thread "main" java.lang.IllegalArgumentException: Data type StringType is not supported.

它基本上说我在字符串上使用VectorAssembler，而我告诉它在df_索引上使用它，df_索引现在有一个数字列_索引，但它似乎没有在VectorAssembler中使用它，我就是不明白。。

非常感谢。

编辑

现在我几乎成功地让它工作了：

val data = session.read
  .format("org.apache.spark.csv")
  .option("header", "true")
  .csv("/home/hvfd8529/Datasets/dataOINIS/dataset.csv")

val stringColumns = Array("country_index", "destination_index", "duration_index")

val stringColumns_index = stringColumns.map(c => s"${c}_index")

val index_transformers = stringColumns.map(
  cname => new StringIndexer()
    .setInputCol(cname)
    .setOutputCol(s"${cname}_index")
)

val assembler  = new VectorAssembler()
    .setInputCols(stringColumns_index)
    .setOutputCol("features")

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a DecisionTree model.
val dt = new DecisionTreeClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("features")
  .setImpurity("entropy")
  .setMaxBins(1000)
  .setMaxDepth(15)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels())

val stages = index_transformers :+ assembler :+ labelIndexer :+ dt :+ labelConverter

val pipeline = new Pipeline()
  .setStages(stages)


// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("predictedLabel", "label", "indexedFeatures").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("accuracy = " + accuracy)

val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println("Learned classification tree model:\n" + treeModel.toDebugString)

但现在我有一个错误说：

value labels is not a member of org.apache.spark.ml.feature.StringIndexer

我不明白，因为我下面的例子是关于火花博士的：/

共2个答案

匿名用户

应该是：

val pipeline = new Pipeline()
  .setStages(index_transformers ++ Array(assembler, dt): Array[PipelineStage])

匿名用户

我为我的第一个问题做了什么：

val stages = index_transformers :+ assembler :+ labelIndexer :+ rf :+ labelConverter

val pipeline = new Pipeline()
  .setStages(stages)

对于标签的第二个问题，我需要使用。像这样拟合（数据）

val labelIndexer = new StringIndexer()
  .setInputCol("label_fraude")
  .setOutputCol("indexedLabel")
  .fit(data)