提问者:小点点

Scikit使用核心外学习多标签分类


我是Scikit新手,为了工作,我正在从事一个项目,涉及70000个网页~250MB文件的多标签分类。由于文件的大小,我不得不使用核心外分类。这些页面的标签是dmoz类别。因此,每个页面可以有多个标签。

我创建了下面的代码,它是通过改编自Scikit学习的外部核心示例而创建的。但是,下面的代码只为每个文档打印一个标签。

1) 是否有某种方法可以按概率打印每个文档的前5个标签?我将非常感谢对代码的任何提示/修改。

2)鉴于OneVsRest不提供partial_fit方法,什么是支持多标签分类的好分类器

文件\u training\u中的文本组合在一起。csv如下所示

"http://home.earthlink.net/~rvbears/","RV Resources - Camping Information - RV Accessories","","","","","RV Resources - Camping Information - RV Accessories RV Resources\, Camping Resources\, Camping Information  RV\, Camping Resources and Information! For Campers\, Travel Trailers\, Motorhome and Fifth Wheels Owners  Camping Games  Camping Recipes  Camping Cooking Supplies  RV Books  RV E-Books  RV Videos/DVD  RV Links   Looking for rv and camping information\, this is it! Check in here for lots of great resources and information especially for newbies. From Camping Gear\, to RV Books\, E-Books\, and Videos our pages are filled with information about everything to do with Camping and RVing to get you headed in the right direction\, from companies you can trust. Refer to the RV Links section for lots of camping gear and rv accessories\, find just about anything that you are looking for. Coming Back Soon....Our ""PRODUCT REVIEWS BLOG"" Will we be returning to reviewing our best bets on some of the newest camping gadgets for inside and outside your rv or tent.      Emergency medical & travel assistance for less than 22 cents a day. Good Sam TravelAssist. Learn More! With over 2 million rescues and recoveries and counting\, Good Sam Roadside Assistance gives our members peace of mind when they travel.  RV Accessories\, RV Decor\, RV Books\, RV E-books\, RV Videos\, RV DVDs RV Resources\, Camping Resources\, Camping Information NOTE: RV Ladders Bears are now SOLD OUT Home | Woodworking Links | Link To Us Copyright  2002-2014 GoCampin'. All Rights Reserved. Go Campin' ~ PO BOX 25417 ~ Greenville\, SC 29616-0417","/Top/Shopping/Crafts/Woodcraft/Decorative|/Top/Shopping/Crafts/Woodcraft/HomeDecor"

这只是CSV文件中的一行。我使用的是第6栏中的文本,第7栏中的标签由|分隔

import codecs
import itertools
import time
import csv
import sys
import re

from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

__author__ = 'prateek.jain'

csv.field_size_limit(sys.maxsize)

sep = b","
quote_char = b'"'

stop = stopwords.words('english')
porter = PorterStemmer()

text_rows = []

text_labels = []

training_file_object = codecs.open('file_training_combined.csv','r', 'utf-8')
wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)

output_file = 'output.csv'
output_file_object = open(output_file, 'w')

for row in wr1:
    text_rows.append(row[6])
    labels = row[7].strip().split('|')
    empty_list = []
    for label in labels:
        if not ('http:' in label.lower() or 'www:' in label.lower()):
            empty_list.append(label)
    text_labels.append(empty_list)


def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text


# dialect='excel'
def stream_docs(path):
    training_file_object = codecs.open(path, 'r', 'utf-8')
    wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)
    print(wr1.next())
    for row in wr1:
        text, label = row[6], row[7]
        labels = label.split('|')
        empty_list = []
        for label in labels:
            if not ('http:' in label.lower() or 'www:' in label.lower()):
                empty_list.append(label)
        yield text, empty_list


def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y


from sklearn.feature_extraction.text import HashingVectorizer

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2 ** 10,
                         preprocessor=None,
                         lowercase=True,
                         tokenizer=tokenizer,
                         non_negative=True, )


clf = MultinomialNB()
doc_stream = stream_docs(path='file_training_combined.csv')





merged = list(itertools.chain(*text_labels))
my_set = set(merged)

class_label_list = list(my_set)
all_class_labels = np.array(class_label_list)
mlb = MultiLabelBinarizer(all_class_labels)

X_test_text, y_test = get_minibatch(doc_stream, 1000)

X_test = vect.transform(X_test_text)

classes = np.array([0, 1])
tick = time.time()
accuracy = 0
total_fit_time = 0
n_train_pos = 0
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    X_train_matrix = vect.fit_transform(X_train)
    y_train = mlb.fit_transform(y_train)
    print X_train_matrix.shape, ' ', y_train.shape
    clf.partial_fit(X_train_matrix.toarray(), y_train, classes=all_class_labels)
    total_fit_time += time.time() - tick
    n_train = X_train_matrix.shape[0]
    n_train_pos += sum(y_train)
    tick = time.time()

predicted = clf.predict(X_test)
all_labels = predicted


for item, labels in zip(X_train, all_labels):
    print '%s => %s' % (item, labels)
    output_file_object.write('%s => %s' % (item, labels) + '\n')

共1个答案

匿名用户

因为只有250mb,所以没有理由离开核心。或者您的内存是否少于250mb?要获得前k个预测,可以使用predict_proba或decision_函数来获得每个标签的可能性。