CountVectorizer method for feature extraction in Chinese

Count Vectorizer Method for Feature Extraction

from sklearn.feature.extraction.text import CountVectorizer

This method continues text categorization by quantitative statistics based on word segmentation

Text feature extraction

Function: Eigenization of text

sklearn.feature_extraction.text.CountVectorizer(stop_words = [])

 Return: Frequency Matrix

CountVectorizer.fit_transform(X) X: Text or Iterable Objects Containing Text Strings

 Return: The sparse matrix can be converted to a two-dimensional array by adding. toarray().

CountVectorizer.inverse_transform(X) X:array array array or sparse matrix

 Return: Data lattice before conversion

CountVectorizer.get_feature_names()

 Return: List of words, or return feature names

Examples of Chinese Feature Extraction (Manual Word Segmentation)

from sklearn.feature_extraction.text import CountVectorizer
#Chinese needs participle, otherwise it takes the whole sentence as a word. English is not needed because English words have spaces.
def chinese_text_count_demo():
    data = ["I love Tian'anmen in Beijing", "Sunrise on Tian'anmen Gate"]
    
    # 1,Instantiate a converter class(Why is it called a converter because it converts text into values?)
    transfer = CountVectorizer()
    
    # 2,call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new.toarray())
    print("Characteristic names:\n", transfer.get_feature_names())
    
    return None

if __name__ == '__main__':
    chinese_text_count_demo()

//Output results:
data_new:
 [[1 1 0]
 [0 1 1]]
//Characteristic names:
 ['Xie', 'Tiananmen', 'sunlight']

Resolution: The first line above indicates that, data First sentence

Numbers denote the number of times this feature word occurs.

Examples of Chinese feature extraction (using jieba participle)

First you need to download jieba from your cmd command line

pip3 install jieba / pip install jieba

from sklearn.feature_extraction.text import CountVectorizer
import jieba

def cut_word(text):
    #Chinese word segmentation
    return " ".join(list(jieba.cut(text)))
    # jieba.cut(text)Returns a generator object that needs to be converted to an iterator
    #return "".join(jieba.lcut(text))
    #jieba.cut(text)Return a list directly list

def auto_chinese_text_count_demo():
    data = ["What do you say about this?"
           ,"Tang Long asked aloud what was the matter."
           ,"How about finding a place for a few drinks in the evening?"
           ,"Laozhong led them to Zhu Laoming and stood in front of the cemetery of Big Cypress Tree. He said, "Look at the terrain. If our people come from the city and pass through the big ferry or the small ferry, they will follow the Qianli Dike.""]
    data_new = []
    for sent in data:
        data_new.append(cut_word(sent))
    
    print("After the sentence participle:\n", data_new)
    
    # 1,Instantiate a converter class
    transfer = CountVectorizer(stop_words = ["say","Of"])#Pause words should be pre-processed and cleaned up. Here's just a demonstration.
    
    # 2,call fit_transform
    data_vector_value = transfer.fit_transform(data_new)
    print("data_vector_value:\n", data_vector_value.toarray())
    print("Characteristic names:\n", transfer.get_feature_names())
    
    return None
    
    
if __name__ =='__main__':
    auto_chinese_text_count_demo()


//Output results:
//After the sentence participle:
 ['What do you say about this?', 'Tang Long asked aloud what was the matter.', 'How about finding a place for a few drinks in the evening?', 'Laozhong led them to Zhu Laoming and stood in front of the cemetery of Big Cypress Tree. He said, "Look at the terrain. If our people come from the city and pass through the big ferry or the small ferry, they will follow the Qianli Dike."']
data_vector_value:
 [[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0]
 [1 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1]]
//Characteristic names:
 ['they', 'Several cups', 'Thousands of miles', 'Tourneau', 'Terrain', 'local', 'Graveyard', 'In town', 'loud', 'Cypress tree', 'Dadukou', 'What should I do?', 'What's going on?', 'What about?', 'We', 'Or', 'Find a', 'From', 'Night', 'Zhu Lao Ming', 'Along', 'Ferry', 'Have a look', 'after', 'Lao Zhong Ling', 'Come here', 'this', 'Where?']

Posted by klaibert26 on Mon, 30 Sep 2019 04:31:22 -0700

Programmer Group

CountVectorizer method for feature extraction in Chinese