Chinese word 2vec training of Wiki encyclopedia based on gensim

Keywords: Python encoding xml Linux

Introduction to Word2Vec

Word2Vec is a way of expressing words. Unlike one-hot vector, word 2vec can express the similarity between words by calculating the distance between words. Word2vec extracts more features, which make words with the same contextual semantics as close as possible, while words with less relevant semantics as far as possible. For example, [Tencent] and [Netease] will be very close, and [BMW] and [Porsche] will be very close. Tencent and BMW/ Porsche, Netease and BMW/ Porsche will be far away. Because both Tencent and Netease belong to the Internet category, while BMW and Porsche belong to the automobile category. People gather in groups, and things divide in groups. After all, all the topics discussed in the Internet circle are related to the Internet, while all the topics discussed in the automobile circle are related to automobiles.

How do we get a word word word word 2vec? Next we will show how to use python gensim to get the word vector we want. Generally speaking, it includes the following steps:

  • wiki Chinese Data Preprocessing

  • Text Data Segmentation

  • Gensim word 2 VEC training

wiki Chinese Data Preprocessing

First, download wiki Chinese data: zhwiki-latest-pages-articles.xml.bz2 . Because the zhwiki data contains a lot of traditional characters, we want to get a simplified corpus. Next we need to do the following two things:

  • Getting the original text data from bz2 using WikiCorpus in gensim module

  • Converting Traditional Characters to Simplified Characters Using OpenCC

WikiCorpus Gets Original Text Data

The python code for data processing is as follows:

from __future__ import print_function
from gensim.corpora import WikiCorpus
import jieba
import codecs
import os
import six
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import multiprocessing

 
class Config:
    data_path = 'xxx/zhwiki'
    zhwiki_bz2 = 'zhwiki-latest-pages-articles.xml.bz2'
    zhwiki_raw = 'zhwiki_raw.txt'
    zhwiki_raw_t2s = 'zhwiki_raw_t2s.txt'
    zhwiki_seg_t2s = 'zhwiki_seg.txt'
    embedded_model_t2s = 'embedding_model_t2s/zhwiki_embedding_t2s.model'
    embedded_vector_t2s = 'embedding_model_t2s/vector_t2s'
 
 
def dataprocess(_config):
    i = 0
    if six.PY3:
        output = open(os.path.join(_config.data_path, _config.zhwiki_raw), 'w')
    output = codecs.open(os.path.join(_config.data_path, _config.zhwiki_raw), 'w')
    wiki = WikiCorpus(os.path.join(_config.data_path, _config.zhwiki_bz2), lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(b' '.join(text).decode('utf-8', 'ignore') + '\n')
        else:
            output.write(' '.join(text) + '\n')
        i += 1
        if i % 10000 == 0:
            print('Saved ' + str(i) + ' articles')
    output.close()
    print('Finished Saved ' + str(i) + ' articles')

config = Config()
dataprocess(config)

Converting Traditional Characters to Simplified Characters Using OpenCC

Here, you need to install OpenCC beforehand. For the installation method of OpenCC in linux environment, please refer to This article . It takes only two lines of linux commands to convert traditional characters into simplified ones, and it's very fast.

$ cd /xxx/zhwiki/
$ opencc -i zhwiki_raw.txt -o zhwiki_t2s.txt -c t2s.json

Text Data Segmentation

For the task of word segmentation, we use python's jieba module directly. You can also use the ltp of Harbin Institute of Technology or the nltk python interface of Stanford for word segmentation with high accuracy and authority. But these two installations will take a long time, especially at Stanford. For jieba word segmentation code, refer to the following:

def is_alpha(tok):
    try:
        return tok.encode('ascii').isalpha()
    except UnicodeEncodeError:
        return False


def zhwiki_segment(_config, remove_alpha=True):
    i = 0
    if six.PY3:
        output = open(os.path.join(_config.data_path, _config.zhwiki_seg_t2s), 'w', encoding='utf-8')
    output = codecs.open(os.path.join(_config.data_path, _config.zhwiki_seg_t2s), 'w', encoding='utf-8')
    print('Start...')
    with codecs.open(os.path.join(_config.data_path, _config.zhwiki_raw_t2s), 'r', encoding='utf-8') as raw_input:
        for line in raw_input.readlines():
            line = line.strip()
            i += 1
            print('line ' + str(i))
            text = line.split()
            if True:
                text = [w for w in text if not is_alpha(w)]
            word_cut_seed = [jieba.cut(t) for t in text]
            tmp = ''
            for sent in word_cut_seed:
                for tok in sent:
                    tmp += tok + ' '
            tmp = tmp.strip()
            if tmp:
                output.write(tmp + '\n')
        output.close()

zhwiki_segment(config)

Gensim word 2 VEC training

python's gensim module provides word2vec training, which provides great convenience for the training of our model. For the use of gensim, you can refer to Word2Vec Practice Based on Gensim.
In this training, the size of the word vector is 50, the training window is 5, and the minimum word frequency is 5. Multithreading is used. The specific code is as follows:

def word2vec(_config, saved=False):
    print('Start...')
    model = Word2Vec(LineSentence(os.path.join(_config.data_path, _config.zhwiki_seg_t2s)),
                     size=50, window=5, min_count=5, workers=multiprocessing.cpu_count())
    if saved:
        model.save(os.path.join(_config.data_path, _config.embedded_model_t2s))
        model.save_word2vec_format(os.path.join(_config.data_path, _config.embedded_vector_t2s), binary=False)
    print("Finished!")
    return model
 
 
def wordsimilarity(word, model):
    semi = ''
    try:
        semi = model.most_similar(word, topn=10)
    except KeyError:
        print('The word not in vocabulary!')
    for term in semi:
        print('%s,%s' % (term[0],term[1]))

model = word2vec(config, saved=True)

word2vec training has been completed, and we get the desired model and word vectors and save them locally. Let's look at the top 10 words that are closest to BMW and Tencent, respectively. It can be found that most of the words similar to BMW belong to the automotive industry, and are automotive brands; most of the words similar to Tencent belong to the Internet industry.

Wordsimilarity (word = u'BMW', model=model)
Porsche, 0.92567974329
 Goodyear, 0.88827884 1972
 Rolls-Royce, 0.884045600891
 Audi, 0.881808757782
 Mazda, 0.881799697876
 Yafit, 0.880708634853
 Opel, 0.877104878426
 Citroen, 0.876984715462
 Maserati, 0.868475496769
 Santana, 0.865387916565

Wordsimilarity (word = u'Tencent', model=model)
Netease, 0.880213916302
 Youku, 0.873666107655
 Tencent, 0.87026232481
 Guangzhou Daily, 0.859486758709
 Wechat, 0.835543811321
 Tianya Community, 0.834927380085
 Robin Li, 0.832848489285
 Potato Net, 0.831390202045
 Group Purchase, 0.829696238041
 Sohu. com, 0.825544642448

Posted by picos on Fri, 14 Jun 2019 16:25:26 -0700