Sesame HTTP: the pit of scikit learn Bayesian text classification

Keywords: encoding JSON pip Pycharm

Basic steps:

1. Training material classification:

I refer to the official directory structure:

Put the corresponding text, a txt file and a corresponding article in each directory: as follows


It should be noted that the proportion of all materials should be kept at the same proportion (adjusted according to the training results, the proportion should not be too large, and it is easy to cause over fitting (popular point is that most articles give you the category with the most materials)

Don't talk too much and go straight to the code

Need a gadget: pip install Chinese tokenizer

Here's the trainer:

import re
import jieba
import json
from io import BytesIO
from chinese_tokenizer.tokenizer import Tokenizer
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.externals import joblib

jie_ba_tokenizer = Tokenizer().jie_ba_tokenizer

# Load dataset
training_data = load_files('./data', encoding='utf-8')
# X ﹣ train TXT content y ﹣ train is the category (positive and negative)
x_train, _, y_train, _ = train_test_split(,
print('Start modeling.....')
with open('', 'w', encoding='utf-8') as f:
# The tokenizer parameter is a function used to segment text
count_vect = CountVectorizer(tokenizer=jieba_tokenizer)

tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(x_train)

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print('Training classifier.....')
# Polynomial Bayesian classifier training
clf = MultinomialNB().fit(X_train_tfidf, y_train)
# Save classifier (good for use in other programs)
joblib.dump(clf, 'model.pkl')
# Save vectorization (pit here!! You need to use the same vectorizer as the trainer or you will report an error!!!!!! Prompt ValueError dimension mismatch...)
joblib.dump(count_vect, 'count_vect')
print("Information about the classifier:")

The following is a classification article using a trained classifier:

Articles that need to be classified are placed in the predict_data Directory: still an article is a txt file

# -*- coding: utf-8 -*-
# @Time    : 2017/8/23 18:02
# @Author: ouch
# @Site    : 
# @File: Bayesian
# @Software: PyCharm
import re
import jieba
import json
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.externals import joblib
# Load classifier
clf = joblib.load('model.pkl')
count_vect = joblib.load('count_vect')
testing_data = load_files('./predict_data', encoding='utf-8')
target_names = json.loads(open('', 'r', encoding='utf-8').read())
#     # string manipulation
tfidf_transformer = TfidfTransformer()
X_new_counts = count_vect.transform(
X_new_tfidf = tfidf_transformer.fit_transform(X_new_counts)
# Forecast
predicted = clf.predict(X_new_tfidf)
for title, category in zip(testing_data.filenames, predicted):
    print('%r => %s' % (title, target_names[category]))

In this way, the trained classifier will not report an error when used in the new program: ValueError dimension mismatch··

Posted by KevinCB on Sat, 02 May 2020 13:36:22 -0700