提问者:小点点

保存并使用TFIDF矢量器以备将来的示例,然后导致维度错误


因此,我正在从Skilearn训练一个多项式朴素贝叶斯分类器。实际上,我现在可以使用sklearn中的保存该分类器。外部导入作业库

我现在想做一个脚本来分类新的例子。我唯一的问题是获取新数据,作为字符串并将其传递到classifier.predict(...)要求数据以矢量化的形式。

之前,我将创建一个矢量,通过以下:

vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2),  stop_words='english', strip_accents='unicode', norm='l2',decode_error="ignore")

现在TFIDF矢量化的工作方式是,它需要许多许多文档。但是通过创建一个新的矢量化器,我不能只给它传递一个单一的数据结构,然后对它进行分类。我显然需要保存这个矢量。

这真的涉及到如何将数据转换为我训练分类器的相同形式!?

我使用transform矢量器是否正确。转换(X\U测试\U标题)

编辑:

似乎我在上面的最后一条评论是对的。然而,当现在将分类器和矢量化器加载到我的脚本中时,我似乎在将矢量化数据传递给分类器时遇到了问题。下面是我的函数,取一个标题和文档,它们都是干净的字符串:

def predict_function(title_data, document_data):
    data =  ((title + ' ') * number_repeat_title(title_data, document_data)) + document_data
    # requires a list
    data = [data, 'testing another element works']
    print data
    data_vector = vectorizer.transform(data)
    print data_vector # checking data is good!
    predicted = classifier.predict(data_vector) 
    return predicted

调用此函数的示例如下:

predict_function('mr sponge bob square pants', 'SpongeBob SquarePants is an American animated television series created by marine biologist and animator Stephen Hillenburg for Nickelodeon. The series chronicles the adventures and endeavors of the title character and his various friends in the fictional underwater city of Bikini Bottom. The series' popularity has made it a media franchise, as well as Nickelodeon network's highest rated show, and the most distributed property of MTV Networks. The media franchise has generated $8 billion in merchandising revenue for Nickelodeon.')

我得到一个错误,我预测:

predicted = classifier.predict(data_vector) 

给。。。。

/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/naive_bayes.pyc in predict(self, X)
     61             Predicted target values for X
     62         """
---> 63         jll = self._joint_log_likelihood(X)
     64         return self.classes_[np.argmax(jll, axis=1)]
     65 

/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
    455         """Calculate the posterior log probability of the samples X"""
    456         X = atleast2d_or_csr(X)
--> 457         return (safe_sparse_dot(X, self.feature_log_prob_.T)
    458                 + self.class_log_prior_)
    459 

/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output)
    189     from scipy import sparse
    190     if sparse.issparse(a) or sparse.issparse(b):
--> 191         ret = a * b
    192         if dense_output and hasattr(ret, "toarray"):
    193             ret = ret.toarray()

/Library/Python/2.7/site-packages/scipy-0.14.0.dev_572aaf0-py2.7-macosx-10.9-intel.egg/scipy/sparse/base.pyc in __mul__(self, other)
    337 
    338             if other.shape[0] != self.shape[1]:
--> 339                 raise ValueError('dimension mismatch')
    340 
    341             result = self._mul_multivector(np.asarray(other))

ValueError: dimension mismatch

共1个答案

匿名用户

看看这里找到的Scikit-学留档(http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html)我相信你是正确的。

Scikit学习示例中的训练数据矢量化如下:

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)

这意味着矢量器现在将记住TFxIDF权重。

然后使用以下代码行将这些权重应用于测试数据:

X_test = vectorizer.transform(data_test.data)