Introduction to Word2Vec
Word2Vec is a way of expressing words. Unlike one-hot vector, word 2vec can express the similarity between words by calculating the distance between words. Word2vec extracts more features, which make words with the same contextual semantics as close as possible, while words with less relevant semantics as far as possible. For example, [Tencent] and [Netease] will be very close, and [BMW] and [Porsche] will be very close. Tencent and BMW/ Porsche, Netease and BMW/ Porsche will be far away. Because both Tencent and Netease belong to the Internet category, while BMW and Porsche belong to the automobile category. People gather in groups, and things divide in groups. After all, all the topics discussed in the Internet circle are related to the Internet, while all the topics discussed in the automobile circle are related to automobiles.
How do we get a word word word word 2vec? Next we will show how to use python gensim to get the word vector we want. Generally speaking, it includes the following steps:
wiki Chinese Data Preprocessing
Text Data Segmentation
Gensim word 2 VEC training
wiki Chinese Data Preprocessing
First, download wiki Chinese data: zhwiki-latest-pages-articles.xml.bz2 . Because the zhwiki data contains a lot of traditional characters, we want to get a simplified corpus. Next we need to do the following two things:
Getting the original text data from bz2 using WikiCorpus in gensim module
Converting Traditional Characters to Simplified Characters Using OpenCC
WikiCorpus Gets Original Text Data
The python code for data processing is as follows:
from __future__ import print_function from gensim.corpora import WikiCorpus import jieba import codecs import os import six from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence import multiprocessing class Config: data_path = 'xxx/zhwiki' zhwiki_bz2 = 'zhwiki-latest-pages-articles.xml.bz2' zhwiki_raw = 'zhwiki_raw.txt' zhwiki_raw_t2s = 'zhwiki_raw_t2s.txt' zhwiki_seg_t2s = 'zhwiki_seg.txt' embedded_model_t2s = 'embedding_model_t2s/zhwiki_embedding_t2s.model' embedded_vector_t2s = 'embedding_model_t2s/vector_t2s' def dataprocess(_config): i = 0 if six.PY3: output = open(os.path.join(_config.data_path, _config.zhwiki_raw), 'w') output = codecs.open(os.path.join(_config.data_path, _config.zhwiki_raw), 'w') wiki = WikiCorpus(os.path.join(_config.data_path, _config.zhwiki_bz2), lemmatize=False, dictionary={}) for text in wiki.get_texts(): if six.PY3: output.write(b' '.join(text).decode('utf-8', 'ignore') + '\n') else: output.write(' '.join(text) + '\n') i += 1 if i % 10000 == 0: print('Saved ' + str(i) + ' articles') output.close() print('Finished Saved ' + str(i) + ' articles') config = Config() dataprocess(config)
Converting Traditional Characters to Simplified Characters Using OpenCC
Here, you need to install OpenCC beforehand. For the installation method of OpenCC in linux environment, please refer to This article . It takes only two lines of linux commands to convert traditional characters into simplified ones, and it's very fast.
$ cd /xxx/zhwiki/ $ opencc -i zhwiki_raw.txt -o zhwiki_t2s.txt -c t2s.json
Text Data Segmentation
For the task of word segmentation, we use python's jieba module directly. You can also use the ltp of Harbin Institute of Technology or the nltk python interface of Stanford for word segmentation with high accuracy and authority. But these two installations will take a long time, especially at Stanford. For jieba word segmentation code, refer to the following:
def is_alpha(tok): try: return tok.encode('ascii').isalpha() except UnicodeEncodeError: return False def zhwiki_segment(_config, remove_alpha=True): i = 0 if six.PY3: output = open(os.path.join(_config.data_path, _config.zhwiki_seg_t2s), 'w', encoding='utf-8') output = codecs.open(os.path.join(_config.data_path, _config.zhwiki_seg_t2s), 'w', encoding='utf-8') print('Start...') with codecs.open(os.path.join(_config.data_path, _config.zhwiki_raw_t2s), 'r', encoding='utf-8') as raw_input: for line in raw_input.readlines(): line = line.strip() i += 1 print('line ' + str(i)) text = line.split() if True: text = [w for w in text if not is_alpha(w)] word_cut_seed = [jieba.cut(t) for t in text] tmp = '' for sent in word_cut_seed: for tok in sent: tmp += tok + ' ' tmp = tmp.strip() if tmp: output.write(tmp + '\n') output.close() zhwiki_segment(config)
Gensim word 2 VEC training
python's gensim module provides word2vec training, which provides great convenience for the training of our model. For the use of gensim, you can refer to Word2Vec Practice Based on Gensim.
In this training, the size of the word vector is 50, the training window is 5, and the minimum word frequency is 5. Multithreading is used. The specific code is as follows:
def word2vec(_config, saved=False): print('Start...') model = Word2Vec(LineSentence(os.path.join(_config.data_path, _config.zhwiki_seg_t2s)), size=50, window=5, min_count=5, workers=multiprocessing.cpu_count()) if saved: model.save(os.path.join(_config.data_path, _config.embedded_model_t2s)) model.save_word2vec_format(os.path.join(_config.data_path, _config.embedded_vector_t2s), binary=False) print("Finished!") return model def wordsimilarity(word, model): semi = '' try: semi = model.most_similar(word, topn=10) except KeyError: print('The word not in vocabulary!') for term in semi: print('%s,%s' % (term[0],term[1])) model = word2vec(config, saved=True)
word2vec training has been completed, and we get the desired model and word vectors and save them locally. Let's look at the top 10 words that are closest to BMW and Tencent, respectively. It can be found that most of the words similar to BMW belong to the automotive industry, and are automotive brands; most of the words similar to Tencent belong to the Internet industry.
Wordsimilarity (word = u'BMW', model=model) Porsche, 0.92567974329 Goodyear, 0.88827884 1972 Rolls-Royce, 0.884045600891 Audi, 0.881808757782 Mazda, 0.881799697876 Yafit, 0.880708634853 Opel, 0.877104878426 Citroen, 0.876984715462 Maserati, 0.868475496769 Santana, 0.865387916565 Wordsimilarity (word = u'Tencent', model=model) Netease, 0.880213916302 Youku, 0.873666107655 Tencent, 0.87026232481 Guangzhou Daily, 0.859486758709 Wechat, 0.835543811321 Tianya Community, 0.834927380085 Robin Li, 0.832848489285 Potato Net, 0.831390202045 Group Purchase, 0.829696238041 Sohu. com, 0.825544642448