Python-based Practice of Simple Natural Language Processing

Keywords: Python Mac encoding Oracle

Python-based Practice of Simple Natural Language Processing Subordinate to the author Data Science and Machine Learning Manual for Programmed Apes.

Simple Natural Language Processing Based on Python

This article is an introduction to simple natural language processing tasks based on Python. All the code in this article is placed in Here . Recommended Pre-reading Python Syntax Sketch and Construction of Machine Learning Development Environment More Machine Learning Materials Reference List of Recommended Books in Machine Learning, Deep Learning and Natural Language Processing as well as Data Science and Machine Learning Knowledge System and Data Collection Oriented to Program Apes.

Twenty News Group Corpus Processing

The 20 Newsgroup dataset contains about 20,000 documents from different newsgroups, which were first collected and collated by Ken Lang. This part includes data set capture, feature extraction, simple classifier training, topic model training and so on. This section contains the main processing code Packaging Library and Notebook-based interactive demonstration . First of all, we need to capture data:

    def fetch_data(self, subset='train', categories=None):
        """return data
        //Perform data grabbing operations
        Arguments:
        subset -> string -- Target set captured train / test / all
        """
        rand = np.random.mtrand.RandomState(8675309)
        data = fetch_20newsgroups(subset=subset,
                                  categories=categories,
                                  shuffle=True,
                                  random_state=rand)

        self.data[subset] = data

Then the data format is viewed interactively in Notebook:

# Instance object
twp = TwentyNewsGroup()
# Grabbing data
twp.fetch_data()
twenty_train = twp.data['train']
print("Data Set Structure", "->", twenty_train.keys())
print("Number of documents", "->", len(twenty_train.data))
print("target classification", "->",[ twenty_train.target_names[t] for t in twenty_train.target[:10]])

//Data Set Structure - > dict_keys (['data','filenames','target_names','target','DESCR','description'])
//Number of documents - > 11314
//Target classification - > ['sci.space','comp.sys.mac.hardware','sci.electronics','comp.sys.mac.hardware','sci.space','rec.sport.hockey','talk.religion.misc','sci med','talk.religion.misc','talk.politics.guns']

Next, we can extract the features from the corpus.

# Feature extraction

# Building Document-Term Matrix

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(twenty_train.data)

print("DTM structure","->",X_train_counts.shape)

# Look at the subscript of a word in the vocabulary
print("Word corresponding subscript","->", count_vect.vocabulary_.get(u'algorithm'))

DTM structure -> (11314, 130107)
//Word corresponding subscript - > 27366

In order to use documents for classification tasks, TF-IDF and other common methods are also needed to convert documents into feature vectors:

# Constructing TF Feature Vectors of Documents
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

print("A document TF feature vector","->",X_train_tf)

# TF-IDF Feature Vector for Document Construction
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tfidf = tf_transformer.transform(X_train_counts)

print("A document TF-IDF feature vector","->",X_train_tfidf)

//TF eigenvector of a document - > (0,6447) 0.0380693493813
  (0, 37842)    0.0380693493813

We can encapsulate feature extraction, classifier training and prediction as separate functions:

    def extract_feature(self):
        """
        //Extracting document features from corpus
        """

        # Document-word matrix for acquiring training data
        self.train_dtm = self.count_vect.fit_transform(self.data['train'].data)

        # Getting TF features of documents

        tf_transformer = TfidfTransformer(use_idf=False)

        self.train_tf = tf_transformer.transform(self.train_dtm)

        # Getting TF-IDF Features of Documents

        tfidf_transformer = TfidfTransformer().fit(self.train_dtm)

        self.train_tfidf = tf_transformer.transform(self.train_dtm)

    def train_classifier(self):
        """
        //Classifier training from training set
        """

        self.extract_feature();

        self.clf = MultinomialNB().fit(
            self.train_tfidf, self.data['train'].target)

    def predict(self, docs):
        """
        //Classifier training from training set
        """

        X_new_counts = self.count_vect.transform(docs)

        tfidf_transformer = TfidfTransformer().fit(X_new_counts)
        
        X_new_tfidf = tfidf_transformer.transform(X_new_counts)

        return self.clf.predict(X_new_tfidf)

Then the training is carried out and the prediction and evaluation are carried out.

# Training classifier
twp.train_classifier()

# Execution Forecast
docs_new = ['God is love', 'OpenGL on the GPU is fast']
predicted = twp.predict(docs_new)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))
    
# Execution Model Assessment
twp.fetch_data(subset='test')

predicted = twp.predict(twp.data['test'].data)

import numpy as np

# Error calculation

# Simple Error Mean
np.mean(predicted == twp.data['test'].target)   

# Metrics

from sklearn import metrics

print(metrics.classification_report(
    twp.data['test'].target, predicted,
    target_names=twp.data['test'].target_names))

# Confusion Matrix
metrics.confusion_matrix(twp.data['test'].target, predicted)

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => rec.autos
                          precision    recall  f1-score   support

             alt.atheism       0.79      0.50      0.61       319
           ...
      talk.religion.misc       1.00      0.08      0.15       251

             avg / total       0.82      0.79      0.77      7532

Out[16]:
array([[158,   0,   1,   1,   0,   1,   0,   3,   7,   1,   2,   6,   1,
          8,   3, 114,   6,   7,   0,   0],
       ...
       [ 35,   3,   1,   0,   0,   0,   1,   4,   1,   1,   6,   3,   0,
          6,   5, 127,  30,   5,   2,  21]])

We can also extract topics from document sets:

# Subject extraction

twp.topics_by_lda()

Topic 0 : stream s1 astronaut zoo laurentian maynard s2 gtoal pem fpu
Topic 1 : 145 cx 0d bh sl 75u 6um m6 sy gld
Topic 2 : apartment wpi mars nazis monash palestine ottoman sas winner gerard
Topic 3 : livesey contest satellite tamu mathew orbital wpd marriage solntze pope
Topic 4 : x11 contest lib font string contrib visual xterm ahl brake
Topic 5 : ax g9v b8f a86 1d9 pl 0t wm 34u giz
Topic 6 : printf null char manes behanna senate handgun civilians homicides magpie
Topic 7 : buf jpeg chi tor bos det que uwo pit blah
Topic 8 : oracle di t4 risc nist instruction msg postscript dma convex
Topic 9 : candida cray yeast viking dog venus bloom symptoms observatory roby
Topic 10 : cx ck hz lk mv cramer adl optilink k8 uw
Topic 11 : ripem rsa sandvik w0 bosnia psuvm hudson utk defensive veal
Topic 12 : db espn sabbath br widgets liar davidian urartu sdpa cooling
Topic 13 : ripem dyer ucsu carleton adaptec tires chem alchemy lockheed rsa
Topic 14 : ingr sv alomar jupiter borland het intergraph factory paradox captain
Topic 15 : militia palestinian cpr pts handheld sharks igc apc jake lehigh
Topic 16 : alaska duke col russia uoknor aurora princeton nsmca gene stereo
Topic 17 : uuencode msg helmet eos satan dseg homosexual ics gear pyron
Topic 18 : entries myers x11r4 radar remark cipher maine hamburg senior bontchev
Topic 19 : cubs ufl vitamin temple gsfc mccall astro bellcore uranium wesleyan

Common Natural Language Processing Tool Packaging

After the introduction of 20NewsGroup corpus processing above, we can find that common tasks of natural language processing include data acquisition, data preprocessing, data feature extraction, classification model training, topic model or word vector extraction and so on. The author is also used to it. python-fire Fast encapsulation of classes is a tool that can be invoked through the command line, while supporting the use of external module calls. In this section, we mainly take the Chinese corpus as an example. For example, we need to analyze the Chinese Wikipedia data. We can use gensim. Wikipedia Processing Class:

class Wiki(object):
    """
    //Wikipedia Corpus Processing
    """
    
    def wiki2texts(self, wiki_data_path, wiki_texts_path='./wiki_texts.txt'):
        """
        //Converting Wikipedia Data into Text Data
        Arguments:
        wiki_data_path -- Wikipedia Compressed File Address
        """
        if not wiki_data_path:
            print("Please enter Wiki Compressed file path or destination https://dumps.wikimedia.org/zhwiki/Download")
            exit()

        # Building Wikipedia Corpus
        wiki_corpus = WikiCorpus(wiki_data_path, dictionary={})
        texts_num = 0

        with open(wiki_text_path, 'w', encoding='utf-8') as output:
            for text in wiki_corpus.get_texts():
                output.write(b' '.join(text).decode('utf-8') + '\n')
                texts_num += 1
                if texts_num % 10000 == 0:
                    logging.info("Processed %d Articles" % texts_num)

        print("After processing, please use OpenCC Converting to Simplified Characters")

After grabbing, we also need to convert OpenCC into simplified characters. After grabbing, we can use the stuttering participle to segment the generated text file, code reference. Here We execute this task directly using Python chinese_text_processor.py tokenize_file/output.txt and generate the output file. After getting the document after word segmentation, we can translate it into simple word bag representation or document-word vector for detailed code reference. Here:

class CorpusProcessor:
    """
    //Corpus Processing
    """

    def corpus2bow(self, tokenized_corpus=default_documents):
        """returns (vocab,corpus_in_bow)
        //Converting Corpus into BOW Form
        Arguments:
        tokenized_corpus -- A list of participled documents
        Return:
        vocab -- {'human': 0, ... 'minors': 11}
        corpus_in_bow -- [[(0, 1), (1, 1), (2, 1)]...]
        """
        dictionary = corpora.Dictionary(tokenized_corpus)

        # Get a glossary
        vocab = dictionary.token2id

        # Wordbag Representation for Getting Documents
        corpus_in_bow = [dictionary.doc2bow(text) for text in tokenized_corpus]

        return (vocab, corpus_in_bow)

    def corpus2dtm(self, tokenized_corpus=default_documents, min_df=10, max_df=100):
        """returns (vocab, DTM)
        //Converting Corpus into Document-Word Matrix
        - dtm -> matrix: File-Word Matrix
                I    like    hate    databases
        D1    1      1          0            1
        D2    1      0          1            1
        """

        if type(tokenized_corpus[0]) is list:
            documents = [" ".join(document) for document in tokenized_corpus]
        else:
            documents = tokenized_corpus

        if max_df == -1:
            max_df = round(len(documents) / 2)

        # Constructing Statistical Vector of Corpus
        vec = CountVectorizer(min_df=min_df,
                              max_df=max_df,
                              analyzer="word",
                              token_pattern="[\S]+",
                              tokenizer=None,
                              preprocessor=None,
                              stop_words=None
                              )

        # Data analysis
        DTM = vec.fit_transform(documents)

        # Get a glossary
        vocab = vec.get_feature_names()

        return (vocab, DTM)

We can also extract topic models or word vectors from documents after word segmentation, where the differences between Chinese and English can be ignored by using the documents after word segmentation.

    def topics_by_lda(self, tokenized_corpus_path, num_topics=20, num_words=10, max_lines=10000, split="\s+", max_df=100):
        """
        //Read the participled files and train them in LDA
        Arguments:
        tokenized_corpus_path -> string -- Corpus address after word segmentation
        num_topics -> integer -- Number of topics
        num_words -> integer -- Number of subject words
        max_lines -> integer -- Maximum number of rows per read
        split -> string -- A separator between words in a document
        max_df -> integer -- Avoid common words and filter words that exceed this threshold
        """

        # Store all corpus information
        corpus = []

        with open(tokenized_corpus_path, 'r', encoding='utf-8') as tokenized_corpus:

            flag = 0

            for document in tokenized_corpus:

                # Determine whether enough rows have been read
                if(flag > max_lines):
                    break

                # Add the read content to the corpus
                corpus.append(re.split(split, document))

                flag = flag + 1

        # BOW Representation for Constructing Corpus
        (vocab, DTM) = self.corpus2dtm(corpus, max_df=max_df)

        # Training LDA Model

        lda = LdaMulticore(
            matutils.Sparse2Corpus(DTM, documents_columns=False),
            num_topics=num_topics,
            id2word=dict([(i, s) for i, s in enumerate(vocab)]),
            workers=4
        )

        # Print and return subject data
        topics = lda.show_topics(
            num_topics=num_topics,
            num_words=num_words,
            formatted=False,
            log=False)

        for ti, topic in enumerate(topics):
            print("Topic", ti, ":", " ".join(word[0] for word in topic[1]))

The function can also be invoked directly from the command line and passed into the file after the participle. We can also build word vectors and code references for their corpus. Here If you are not familiar with the basic use of word vectors, you can refer to them. Word2Vec Practice Based on Gensim:

    def wv_train(self, tokenized_text_path, output_model_path='./wv_model.bin'):
        """
        //Word vector training is performed for text and the output word vector is saved.
        """

        sentences = word2vec.Text8Corpus(tokenized_text_path)

        # Model training
        model = word2vec.Word2Vec(sentences, size=250)

        # Save the model
        model.save(output_model_path)

    def wv_visualize(self, model_path, word=["China", "aviation"]):
        """
        //Search for adjacent words according to the input words and then visualize them.
        //Parameters:
            model_path: Word2Vec Model address
        """

        # Loading model
        model = word2vec.Word2Vec.load(model_path)

        # Find out the most similar words
        words = [wp[0] for wp in model.most_similar(word, topn=20)]

        # Extraction of word vectors corresponding to words
        wordsInVector = [model[word] for word in words]

        # PCA dimensionality reduction
        pca = PCA(n_components=2)
        pca.fit(wordsInVector)
        X = pca.transform(wordsInVector)

        # Drawing Graphics
        xs = X[:, 0]
        ys = X[:, 1]

        plt.figure(figsize=(12, 8))
        plt.scatter(xs, ys, marker='o')

        # Traverse through all words to add annotations
        for i, w in enumerate(words):
            plt.annotate(
                w,
                xy=(xs[i], ys[i]), xytext=(6, 6),
                textcoords='offset points', ha='left', va='top',
                **dict(fontsize=10)
            )
        plt.show()

Posted by kykin on Thu, 11 Jul 2019 13:59:54 -0700