Count Vectorizer Method for Feature Extraction
from sklearn.feature.extraction.text import CountVectorizer
This method continues text categorization by quantitative statistics based on word segmentation
Text feature extraction
Function: Eigenization of text
sklearn.feature_extraction.text.CountVectorizer(stop_words = [])
Return: Frequency Matrix
CountVectorizer.fit_transform(X) X: Text or Iterable Objects Containing Text Strings
Return: The sparse matrix can be converted to a two-dimensional array by adding. toarray().
CountVectorizer.inverse_transform(X) X:array array array or sparse matrix
Return: Data lattice before conversion
CountVectorizer.get_feature_names()
Return: List of words, or return feature names
Examples of Chinese Feature Extraction (Manual Word Segmentation)
from sklearn.feature_extraction.text import CountVectorizer #Chinese needs participle, otherwise it takes the whole sentence as a word. English is not needed because English words have spaces. def chinese_text_count_demo(): data = ["I love Tian'anmen in Beijing", "Sunrise on Tian'anmen Gate"] # 1,Instantiate a converter class(Why is it called a converter because it converts text into values?) transfer = CountVectorizer() # 2,call fit_transform data_new = transfer.fit_transform(data) print("data_new:\n", data_new.toarray()) print("Characteristic names:\n", transfer.get_feature_names()) return None if __name__ == '__main__': chinese_text_count_demo() //Output results: data_new: [[1 1 0] [0 1 1]] //Characteristic names: ['Xie', 'Tiananmen', 'sunlight']Resolution: The first line above indicates that, data First sentence
Numbers denote the number of times this feature word occurs.
Examples of Chinese feature extraction (using jieba participle)
First you need to download jieba from your cmd command line
pip3 install jieba / pip install jieba
from sklearn.feature_extraction.text import CountVectorizer import jieba def cut_word(text): #Chinese word segmentation return " ".join(list(jieba.cut(text))) # jieba.cut(text)Returns a generator object that needs to be converted to an iterator #return "".join(jieba.lcut(text)) #jieba.cut(text)Return a list directly list def auto_chinese_text_count_demo(): data = ["What do you say about this?" ,"Tang Long asked aloud what was the matter." ,"How about finding a place for a few drinks in the evening?" ,"Laozhong led them to Zhu Laoming and stood in front of the cemetery of Big Cypress Tree. He said, "Look at the terrain. If our people come from the city and pass through the big ferry or the small ferry, they will follow the Qianli Dike.""] data_new = [] for sent in data: data_new.append(cut_word(sent)) print("After the sentence participle:\n", data_new) # 1,Instantiate a converter class transfer = CountVectorizer(stop_words = ["say","Of"])#Pause words should be pre-processed and cleaned up. Here's just a demonstration. # 2,call fit_transform data_vector_value = transfer.fit_transform(data_new) print("data_vector_value:\n", data_vector_value.toarray()) print("Characteristic names:\n", transfer.get_feature_names()) return None if __name__ =='__main__': auto_chinese_text_count_demo() //Output results: //After the sentence participle: ['What do you say about this?', 'Tang Long asked aloud what was the matter.', 'How about finding a place for a few drinks in the evening?', 'Laozhong led them to Zhu Laoming and stood in front of the cemetery of Big Cypress Tree. He said, "Look at the terrain. If our people come from the city and pass through the big ferry or the small ferry, they will follow the Qianli Dike."'] data_vector_value: [[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0] [1 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1]] //Characteristic names: ['they', 'Several cups', 'Thousands of miles', 'Tourneau', 'Terrain', 'local', 'Graveyard', 'In town', 'loud', 'Cypress tree', 'Dadukou', 'What should I do?', 'What's going on?', 'What about?', 'We', 'Or', 'Find a', 'From', 'Night', 'Zhu Lao Ming', 'Along', 'Ferry', 'Have a look', 'after', 'Lao Zhong Ling', 'Come here', 'this', 'Where?']