提问者:小点点

交叉验证时索引中没有键错误


我已经在我的数据集上应用了svm。我的数据集是多标签的,这意味着每个观察有多个标签。

KFold交叉验证时,它会引发不在index中的错误

它显示了从601到6007的索引不在index中(我有1…6008个数据样本)。

这是我的代码:

   df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])

for category in categories:
    print('... Processing {} '.format(category))
    # train the model using X_dtm & y
    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])
    print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
    print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
    print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])

实际上,我不知道如何应用KFold交叉验证,在这种交叉验证中,我可以分别获得每个标签的F1分数和准确性。

为了可重现,这是数据帧的一个小样本,最后七个特征是我的标签,包括ADR、WD、…

,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1

更新

当我做Vivek Kumar所说的事情时,就会产生错误

ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]

在分类器部分。你知道如何解决它吗?

stackoverflow中有几个链接表示我需要重塑训练数据。我也这样做了,但没有成功链接谢谢:)


共1个答案

匿名用户

train_indextest_index是基于行数的整数索引。但是熊猫索引不是这样工作的。新版本的熊猫在如何从中切片或选择数据方面更加严格。

您需要使用. iloc来访问数据。更多信息可在此处获得

这就是你需要的:

for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    ...
    ...

    # TfidfVectorizer dont work with DataFrame, 
    # because iterating a DataFrame gives the column names, not the actual data
    # So specify explicitly the column name, to get the sentences

    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])