提问者:小点点

scikit学习示例试用我的分类器和数据


我已经建立了一个小程序,用scikit learn为给定的数据集创建一个分类器。现在我想试试这个例子,看看分类器在工作。例如,clf必须检测“猫”。

我是这样说的:

我有50张猫的照片和50张“没有猫”的照片。

>

  • 使用sift特征检测器获取data_set的描述符
  • 将数据拆分为训练集和测试集(25张图片猫25张图片非猫=training_set,test_set相同)
  • 使用来自training_set
  • 的kmeans获取集群中心
  • 使用集群中心创建training_setantest_set的histogramm数据
  • 尝试以下代码:

    tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
    
    scores = ['precision', 'recall']
    
    for score in scores:
      print("# Tuning hyper-parameters for %s" % score)
      print()
    
      clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)
      clf.fit(X_train, y_train)
    
      print("Best parameters set found on development set:")
      print()
      print(clf.best_estimator_)
      print()
      print("Grid scores on development set:")
      print()
      for params, mean_score, scores in clf.grid_scores_:
         print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() / 2, params))
      print()
      print("Detailed classification report:")
      print()
      print("The model is trained on the full development set.")
      print("The scores are computed on the full evaluation set.")
      print()
      y_true, y_pred = y_test, clf.predict(X_test)
      print y_true
      print y_pred
      print(classification_report(y_true, y_pred))
      print()
      print clf.score(X_train, y_train)
      print "score"
      print clf.best_params_
      print "best_params"
      pred = clf.predict(X_test)
      print accuracy_score(y_test, pred)
      print "accuracy_score"
    

    我得到的结果是:

    # Tuning hyper-parameters for recall
    ()
    /usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 0.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 0.]. 
      average=average)
    /usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 1.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 1.]. 
      average=average)
    Best parameters set found on development set:
    ()
    SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0, degree=3,
      gamma=0.001, kernel=rbf, max_iter=-1, probability=False,
      random_state=None, shrinking=True, tol=0.001, verbose=False)
    ()
    Grid scores on development set:
    ()
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.0001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.0001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.0001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.0001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.0001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.0001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.001}
    0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.0001}
    ()
    Detailed classification report:
    ()
    The model is trained on the full development set.
    The scores are computed on the full evaluation set.
    ()
    [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
      1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
      0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
    [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
      1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
      1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.]
                 precision    recall  f1-score   support
    
            0.0       1.00      0.04      0.08        25
            1.0       0.51      1.00      0.68        25
    
    avg / total       0.76      0.52      0.38        50
    
    ()
    0.52
    score
    {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
    best_params
    0.52
    accuracy_score
    

    似乎CLF对所有人都说它是一只猫......但是为什么呢?

    是否将数据设置为较小以获得良好结果?

    编辑:我正在使用VLFeat来检测筛选描述符

    功能:

    def create_descriptor_data(data, ID):
        descriptor_list = []
        datas = numpy.genfromtxt(data,dtype='str')
        for p in datas:
          locs, desc = vlfeat_module.vlf_create_descriptors(p,str(ID)+'.key',ID) # create descriptors and save descs in file
          if len(desc) > 500:
            desc = desc[::round((len(desc))/400, 1)] # take between 400 - 800 descriptors
          descriptor_list.append(desc)
          ID += 1 # ID for filename
        return descriptor_list
    
    # create k-mean centers from all *.txt files in directory (data)
    def create_center_data(data):
        #data = numpy.vstack(data)
        n_clusters = len(numpy.unique(data))
        kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1)
        kmeans.fit(data)
        return kmeans, n_clusters
    
    def create_histogram_data(kmeans, descs, n_clusters):
        histogram_list = []
        # load from each file data
        for desc in descs:
          length = len(desc)
          # create histogram from descriptors
          histogram = kmeans.predict(desc)
          histogram = numpy.bincount(histogram, minlength=n_clusters) #minlength = k in k-means 
          histogram = numpy.divide(histogram, length, dtype='float')
          histogram_list.append(histogram)
        histogram = numpy.vstack(histogram_list)
        return histogram
    

    电话:

    X_desc_pos = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_pos.txt",0) # create desc from dataset_pos, 25 pics
    X_desc_neg = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_neg.txt",51) # create desc from dataset_neg, 25 pics
    
    X_train_pos, X_test_pos = train_test_split(X_desc_pos, test_size=0.5)
    X_train_neg, X_test_neg = train_test_split(X_desc_neg, test_size=0.5)
    
    x1 = numpy.vstack(X_train_pos)
    x2 = numpy.vstack(X_train_neg)
    kmeans, n_clusters = lib.dataset_module.create_center_data(numpy.vstack((x1,x2)))
    
    X_train_pos = lib.dataset_module.create_histogram_data(kmeans, X_train_pos, n_clusters)
    X_train_neg = lib.dataset_module.create_histogram_data(kmeans, X_train_neg, n_clusters)
    
    X_train = numpy.vstack([X_train_pos, X_train_neg])
    y_train = numpy.hstack([numpy.ones(len(X_train_pos)), numpy.zeros(len(X_train_neg))])
    
    X_test_pos = lib.dataset_module.create_histogram_data(kmeans, X_test_pos, n_clusters)
    X_test_neg = lib.dataset_module.create_histogram_data(kmeans, X_test_neg, n_clusters)
    
    X_test = numpy.vstack([X_test_pos, X_test_neg])
    y_test = numpy.hstack([numpy.ones(len(X_test_pos)), numpy.zeros(len(X_test_neg))])
    
    tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                         'C': [1, 10, 100, 1000]},
                        {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
    
    scores = ['precision', 'recall']
    
    for score in scores:
        print("# Tuning hyper-parameters for %s" % score)
        print()
    
        clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)
        clf.fit(X_train, y_train)
    
        print("Best parameters set found on development set:")
        print()
        print(clf.best_estimator_)
        print()
        print("Grid scores on development set:")
        print()
        for params, mean_score, scores in clf.grid_scores_:
           print("%0.3f (+/-%0.03f) for %r"
                  % (mean_score, scores.std() / 2, params))
        print()
        print("Detailed classification report:")
        print()
        print("The model is trained on the full development set.")
        print("The scores are computed on the full evaluation set.")
        print()
        y_true, y_pred = y_test, clf.predict(X_test)
        print y_true
        print y_pred
        print(classification_report(y_true, y_pred))
        print()
        print clf.score(X_train, y_train)
        print "score"
        print clf.best_params_
        print "best_params"
        pred = clf.predict(X_test)
        print accuracy_score(y_test, pred)
        print "accuracy_score"
    

    编辑:通过更新范围和再次保存“精度”进行了一些更改

    # Tuning hyper-parameters for accuracy
    ()
    Best parameters set found on development set:
    ()
    SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
      gamma=1.0, kernel=rbf, max_iter=-1, probability=False, random_state=None,
      shrinking=True, tol=0.001, verbose=False)
    ()
    Grid scores on development set:
    ()
    ...
    ()
    Detailed classification report:
    ()
    The model is trained on the full development set.
    The scores are computed on the full evaluation set.
    ()
    [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
      1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
      0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
    [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  0.  1.  1.  1.
      1.  1.  1.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
      0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
                 precision    recall  f1-score   support
    
            0.0       0.88      0.92      0.90        25
            1.0       0.92      0.88      0.90        25
    
    avg / total       0.90      0.90      0.90        50
    
    ()
    1.0
    score
    {'kernel': 'rbf', 'C': 1000.0, 'gamma': 1.0}
    best_params
    0.9
    accuracy_score
    

    但是通过在一张照片上用

    rslt = clf.predict(test_histogram)
    

    他仍然对沙发说:“你是一只猫”: D


  • 共2个答案

    匿名用户

    似乎clf对所有人说它是一只猫。。。。但是为什么呢?

    从粘贴的输出中很难分辨出来,但这似乎是分数=['precision','recall']上循环的第二次迭代,因此您正在优化召回。这与分类报告一致,该报告指出,对于阳性类别,召回率为1.00(完美)。

    回忆什么时候是完美的?嗯,当没有假阴性时,就没有猫不被发现。因此,获得完美回忆的简单方法是预测每个输入图片的“cat”,而不管它是否是cat,并且GridSearchCV找到了一个能够准确做到这一点的分类器。

    当您优化精度时,也可能会发生类似的情况:由于不会出现误报,所以永远不要预测“cat”可以实现完美的精度。

    要避免这种情况,请针对准确性而不是精确性或召回率进行优化,或者针对F进行优化ᵦ 如果你有一个班级不平衡的情况。

    匿名用户

    这种行为有许多可能性:

    • 创建培训/测试数据时出错[执行错误]
    • 由20个元素组成的训练集(25个向量,5个交叉验证,20个用于试验)可能太小,无法进行良好的概括[拟合不足]
    • 选中的Cgamma参数的范围可能太窄-此变量高度依赖于数据,您的表示法的值可能需要与当前使用的[欠拟合/过拟合]完全不同的Cgamma

    我个人的猜测(因为没有数据很难重现问题)这里是第三个选择-坏Cgamma参数,以找到一个好的模型。

    编辑

    你应该尝试更大范围的数值,例如。

    >

    C=[]
    gamma=[]
    for i in range(21): C.append(10.0**(i-5))
    for i in range(17): gamma.append(10**(i-14))
    

    编辑2

    一旦参数的范围被纠正,现在你应该执行实际的“案例研究”。收集更多的图像,分析你的数据表示(直方图真的足够完成这项任务吗?),处理你的数据(它已经标准化了吗?也许尝试一些去相关?),考虑使用更简单的内核——rbf可能非常具有欺骗性——一方面,它可以在训练中获得高分,但另一方面,在测试中完全失败。这是其过拟合能力的结果(对于任何一致的数据集,RBF-SVM都可以在训练期间获得100%的分数),因此在模型的功率和泛化能力之间找到平衡是一个难题。这就是真正的“机器学习之旅”开始的时候,玩得开心!