Python-based Practice of Simple Natural Language Processing Subordinate to the author Data Science and Machine Learning Manual for Programmed Apes.
Simple Natural Language Processing Based on Python
This article is an introduction to simple natural language processing tasks based on Python. All the code in this article is placed in Here . Recommended Pre-reading Python Syntax Sketch and Construction of Machine Learning Development Environment More Machine Learning Materials Reference List of Recommended Books in Machine Learning, Deep Learning and Natural Language Processing as well as Data Science and Machine Learning Knowledge System and Data Collection Oriented to Program Apes.
Twenty News Group Corpus Processing
The 20 Newsgroup dataset contains about 20,000 documents from different newsgroups, which were first collected and collated by Ken Lang. This part includes data set capture, feature extraction, simple classifier training, topic model training and so on. This section contains the main processing code Packaging Library and Notebook-based interactive demonstration . First of all, we need to capture data:
def fetch_data(self, subset='train', categories=None): """return data //Perform data grabbing operations Arguments: subset -> string -- Target set captured train / test / all """ rand = np.random.mtrand.RandomState(8675309) data = fetch_20newsgroups(subset=subset, categories=categories, shuffle=True, random_state=rand) self.data[subset] = data
Then the data format is viewed interactively in Notebook:
# Instance object twp = TwentyNewsGroup() # Grabbing data twp.fetch_data() twenty_train = twp.data['train'] print("Data Set Structure", "->", twenty_train.keys()) print("Number of documents", "->", len(twenty_train.data)) print("target classification", "->",[ twenty_train.target_names[t] for t in twenty_train.target[:10]]) //Data Set Structure - > dict_keys (['data','filenames','target_names','target','DESCR','description']) //Number of documents - > 11314 //Target classification - > ['sci.space','comp.sys.mac.hardware','sci.electronics','comp.sys.mac.hardware','sci.space','rec.sport.hockey','talk.religion.misc','sci med','talk.religion.misc','talk.politics.guns']
Next, we can extract the features from the corpus.
# Feature extraction # Building Document-Term Matrix from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(twenty_train.data) print("DTM structure","->",X_train_counts.shape) # Look at the subscript of a word in the vocabulary print("Word corresponding subscript","->", count_vect.vocabulary_.get(u'algorithm')) DTM structure -> (11314, 130107) //Word corresponding subscript - > 27366
In order to use documents for classification tasks, TF-IDF and other common methods are also needed to convert documents into feature vectors:
# Constructing TF Feature Vectors of Documents from sklearn.feature_extraction.text import TfidfTransformer tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts) X_train_tf = tf_transformer.transform(X_train_counts) print("A document TF feature vector","->",X_train_tf) # TF-IDF Feature Vector for Document Construction from sklearn.feature_extraction.text import TfidfTransformer tf_transformer = TfidfTransformer().fit(X_train_counts) X_train_tfidf = tf_transformer.transform(X_train_counts) print("A document TF-IDF feature vector","->",X_train_tfidf) //TF eigenvector of a document - > (0,6447) 0.0380693493813 (0, 37842) 0.0380693493813
We can encapsulate feature extraction, classifier training and prediction as separate functions:
def extract_feature(self): """ //Extracting document features from corpus """ # Document-word matrix for acquiring training data self.train_dtm = self.count_vect.fit_transform(self.data['train'].data) # Getting TF features of documents tf_transformer = TfidfTransformer(use_idf=False) self.train_tf = tf_transformer.transform(self.train_dtm) # Getting TF-IDF Features of Documents tfidf_transformer = TfidfTransformer().fit(self.train_dtm) self.train_tfidf = tf_transformer.transform(self.train_dtm) def train_classifier(self): """ //Classifier training from training set """ self.extract_feature(); self.clf = MultinomialNB().fit( self.train_tfidf, self.data['train'].target) def predict(self, docs): """ //Classifier training from training set """ X_new_counts = self.count_vect.transform(docs) tfidf_transformer = TfidfTransformer().fit(X_new_counts) X_new_tfidf = tfidf_transformer.transform(X_new_counts) return self.clf.predict(X_new_tfidf)
Then the training is carried out and the prediction and evaluation are carried out.
# Training classifier twp.train_classifier() # Execution Forecast docs_new = ['God is love', 'OpenGL on the GPU is fast'] predicted = twp.predict(docs_new) for doc, category in zip(docs_new, predicted): print('%r => %s' % (doc, twenty_train.target_names[category])) # Execution Model Assessment twp.fetch_data(subset='test') predicted = twp.predict(twp.data['test'].data) import numpy as np # Error calculation # Simple Error Mean np.mean(predicted == twp.data['test'].target) # Metrics from sklearn import metrics print(metrics.classification_report( twp.data['test'].target, predicted, target_names=twp.data['test'].target_names)) # Confusion Matrix metrics.confusion_matrix(twp.data['test'].target, predicted) 'God is love' => soc.religion.christian 'OpenGL on the GPU is fast' => rec.autos precision recall f1-score support alt.atheism 0.79 0.50 0.61 319 ... talk.religion.misc 1.00 0.08 0.15 251 avg / total 0.82 0.79 0.77 7532 Out[16]: array([[158, 0, 1, 1, 0, 1, 0, 3, 7, 1, 2, 6, 1, 8, 3, 114, 6, 7, 0, 0], ... [ 35, 3, 1, 0, 0, 0, 1, 4, 1, 1, 6, 3, 0, 6, 5, 127, 30, 5, 2, 21]])
We can also extract topics from document sets:
# Subject extraction twp.topics_by_lda() Topic 0 : stream s1 astronaut zoo laurentian maynard s2 gtoal pem fpu Topic 1 : 145 cx 0d bh sl 75u 6um m6 sy gld Topic 2 : apartment wpi mars nazis monash palestine ottoman sas winner gerard Topic 3 : livesey contest satellite tamu mathew orbital wpd marriage solntze pope Topic 4 : x11 contest lib font string contrib visual xterm ahl brake Topic 5 : ax g9v b8f a86 1d9 pl 0t wm 34u giz Topic 6 : printf null char manes behanna senate handgun civilians homicides magpie Topic 7 : buf jpeg chi tor bos det que uwo pit blah Topic 8 : oracle di t4 risc nist instruction msg postscript dma convex Topic 9 : candida cray yeast viking dog venus bloom symptoms observatory roby Topic 10 : cx ck hz lk mv cramer adl optilink k8 uw Topic 11 : ripem rsa sandvik w0 bosnia psuvm hudson utk defensive veal Topic 12 : db espn sabbath br widgets liar davidian urartu sdpa cooling Topic 13 : ripem dyer ucsu carleton adaptec tires chem alchemy lockheed rsa Topic 14 : ingr sv alomar jupiter borland het intergraph factory paradox captain Topic 15 : militia palestinian cpr pts handheld sharks igc apc jake lehigh Topic 16 : alaska duke col russia uoknor aurora princeton nsmca gene stereo Topic 17 : uuencode msg helmet eos satan dseg homosexual ics gear pyron Topic 18 : entries myers x11r4 radar remark cipher maine hamburg senior bontchev Topic 19 : cubs ufl vitamin temple gsfc mccall astro bellcore uranium wesleyan
Common Natural Language Processing Tool Packaging
After the introduction of 20NewsGroup corpus processing above, we can find that common tasks of natural language processing include data acquisition, data preprocessing, data feature extraction, classification model training, topic model or word vector extraction and so on. The author is also used to it. python-fire Fast encapsulation of classes is a tool that can be invoked through the command line, while supporting the use of external module calls. In this section, we mainly take the Chinese corpus as an example. For example, we need to analyze the Chinese Wikipedia data. We can use gensim. Wikipedia Processing Class:
class Wiki(object): """ //Wikipedia Corpus Processing """ def wiki2texts(self, wiki_data_path, wiki_texts_path='./wiki_texts.txt'): """ //Converting Wikipedia Data into Text Data Arguments: wiki_data_path -- Wikipedia Compressed File Address """ if not wiki_data_path: print("Please enter Wiki Compressed file path or destination https://dumps.wikimedia.org/zhwiki/Download") exit() # Building Wikipedia Corpus wiki_corpus = WikiCorpus(wiki_data_path, dictionary={}) texts_num = 0 with open(wiki_text_path, 'w', encoding='utf-8') as output: for text in wiki_corpus.get_texts(): output.write(b' '.join(text).decode('utf-8') + '\n') texts_num += 1 if texts_num % 10000 == 0: logging.info("Processed %d Articles" % texts_num) print("After processing, please use OpenCC Converting to Simplified Characters")
After grabbing, we also need to convert OpenCC into simplified characters. After grabbing, we can use the stuttering participle to segment the generated text file, code reference. Here We execute this task directly using Python chinese_text_processor.py tokenize_file/output.txt and generate the output file. After getting the document after word segmentation, we can translate it into simple word bag representation or document-word vector for detailed code reference. Here:
class CorpusProcessor: """ //Corpus Processing """ def corpus2bow(self, tokenized_corpus=default_documents): """returns (vocab,corpus_in_bow) //Converting Corpus into BOW Form Arguments: tokenized_corpus -- A list of participled documents Return: vocab -- {'human': 0, ... 'minors': 11} corpus_in_bow -- [[(0, 1), (1, 1), (2, 1)]...] """ dictionary = corpora.Dictionary(tokenized_corpus) # Get a glossary vocab = dictionary.token2id # Wordbag Representation for Getting Documents corpus_in_bow = [dictionary.doc2bow(text) for text in tokenized_corpus] return (vocab, corpus_in_bow) def corpus2dtm(self, tokenized_corpus=default_documents, min_df=10, max_df=100): """returns (vocab, DTM) //Converting Corpus into Document-Word Matrix - dtm -> matrix: File-Word Matrix I like hate databases D1 1 1 0 1 D2 1 0 1 1 """ if type(tokenized_corpus[0]) is list: documents = [" ".join(document) for document in tokenized_corpus] else: documents = tokenized_corpus if max_df == -1: max_df = round(len(documents) / 2) # Constructing Statistical Vector of Corpus vec = CountVectorizer(min_df=min_df, max_df=max_df, analyzer="word", token_pattern="[\S]+", tokenizer=None, preprocessor=None, stop_words=None ) # Data analysis DTM = vec.fit_transform(documents) # Get a glossary vocab = vec.get_feature_names() return (vocab, DTM)
We can also extract topic models or word vectors from documents after word segmentation, where the differences between Chinese and English can be ignored by using the documents after word segmentation.
def topics_by_lda(self, tokenized_corpus_path, num_topics=20, num_words=10, max_lines=10000, split="\s+", max_df=100): """ //Read the participled files and train them in LDA Arguments: tokenized_corpus_path -> string -- Corpus address after word segmentation num_topics -> integer -- Number of topics num_words -> integer -- Number of subject words max_lines -> integer -- Maximum number of rows per read split -> string -- A separator between words in a document max_df -> integer -- Avoid common words and filter words that exceed this threshold """ # Store all corpus information corpus = [] with open(tokenized_corpus_path, 'r', encoding='utf-8') as tokenized_corpus: flag = 0 for document in tokenized_corpus: # Determine whether enough rows have been read if(flag > max_lines): break # Add the read content to the corpus corpus.append(re.split(split, document)) flag = flag + 1 # BOW Representation for Constructing Corpus (vocab, DTM) = self.corpus2dtm(corpus, max_df=max_df) # Training LDA Model lda = LdaMulticore( matutils.Sparse2Corpus(DTM, documents_columns=False), num_topics=num_topics, id2word=dict([(i, s) for i, s in enumerate(vocab)]), workers=4 ) # Print and return subject data topics = lda.show_topics( num_topics=num_topics, num_words=num_words, formatted=False, log=False) for ti, topic in enumerate(topics): print("Topic", ti, ":", " ".join(word[0] for word in topic[1]))
The function can also be invoked directly from the command line and passed into the file after the participle. We can also build word vectors and code references for their corpus. Here If you are not familiar with the basic use of word vectors, you can refer to them. Word2Vec Practice Based on Gensim:
def wv_train(self, tokenized_text_path, output_model_path='./wv_model.bin'): """ //Word vector training is performed for text and the output word vector is saved. """ sentences = word2vec.Text8Corpus(tokenized_text_path) # Model training model = word2vec.Word2Vec(sentences, size=250) # Save the model model.save(output_model_path) def wv_visualize(self, model_path, word=["China", "aviation"]): """ //Search for adjacent words according to the input words and then visualize them. //Parameters: model_path: Word2Vec Model address """ # Loading model model = word2vec.Word2Vec.load(model_path) # Find out the most similar words words = [wp[0] for wp in model.most_similar(word, topn=20)] # Extraction of word vectors corresponding to words wordsInVector = [model[word] for word in words] # PCA dimensionality reduction pca = PCA(n_components=2) pca.fit(wordsInVector) X = pca.transform(wordsInVector) # Drawing Graphics xs = X[:, 0] ys = X[:, 1] plt.figure(figsize=(12, 8)) plt.scatter(xs, ys, marker='o') # Traverse through all words to add annotations for i, w in enumerate(words): plt.annotate( w, xy=(xs[i], ys[i]), xytext=(6, 6), textcoords='offset points', ha='left', va='top', **dict(fontsize=10) ) plt.show()