Basic steps:
1. Training material classification:
I refer to the official directory structure:
Put the corresponding text, a txt file and a corresponding article in each directory: as follows
It should be noted that the proportion of all materials should be kept at the same proportion (adjusted according to the training results, the proportion should not be too large, and it is easy to cause over fitting (popular point is that most articles give you the category with the most materials)
Don't talk too much and go straight to the code
Need a gadget: pip install Chinese tokenizer
Here's the trainer:
import re import jieba import json from io import BytesIO from chinese_tokenizer.tokenizer import Tokenizer from sklearn.datasets import load_files from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.externals import joblib jie_ba_tokenizer = Tokenizer().jie_ba_tokenizer # Load dataset training_data = load_files('./data', encoding='utf-8') # X ﹣ train TXT content y ﹣ train is the category (positive and negative) x_train, _, y_train, _ = train_test_split(training_data.data, training_data.target) print('Start modeling.....') with open('training_data.target', 'w', encoding='utf-8') as f: f.write(json.dumps(training_data.target_names)) # The tokenizer parameter is a function used to segment text count_vect = CountVectorizer(tokenizer=jieba_tokenizer) tfidf_transformer = TfidfTransformer() X_train_counts = count_vect.fit_transform(x_train) X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) print('Training classifier.....') # Polynomial Bayesian classifier training clf = MultinomialNB().fit(X_train_tfidf, y_train) # Save classifier (good for use in other programs) joblib.dump(clf, 'model.pkl') # Save vectorization (pit here!! You need to use the same vectorizer as the trainer or you will report an error!!!!!! Prompt ValueError dimension mismatch...) joblib.dump(count_vect, 'count_vect') print("Information about the classifier:") print(clf)
The following is a classification article using a trained classifier:
Articles that need to be classified are placed in the predict_data Directory: still an article is a txt file
# -*- coding: utf-8 -*- # @Time : 2017/8/23 18:02 # @Author: ouch # @Site : # @File: Bayesian classifier.py # @Software: PyCharm import re import jieba import json from sklearn.datasets import load_files from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from sklearn.externals import joblib # Load classifier clf = joblib.load('model.pkl') count_vect = joblib.load('count_vect') testing_data = load_files('./predict_data', encoding='utf-8') target_names = json.loads(open('training_data.target', 'r', encoding='utf-8').read()) # # string manipulation tfidf_transformer = TfidfTransformer() X_new_counts = count_vect.transform(testing_data.data) X_new_tfidf = tfidf_transformer.fit_transform(X_new_counts) # Forecast predicted = clf.predict(X_new_tfidf) for title, category in zip(testing_data.filenames, predicted): print('%r => %s' % (title, target_names[category]))
In this way, the trained classifier will not report an error when used in the new program: ValueError dimension mismatch··