Natural language processing - use of Jieba word splitter

Keywords: Deep Learning NLP

1. jieba Chinese word segmentation

import jieba

text = "In most cases, vocabulary is the basis of our understanding of sentences and articles, so we need a tool to decompose the complete text into finer grained words."

cut_result = jieba.cut(text, cut_all=True)  # Full mode
print(cut_result)
print("\n Full mode : " + "/ ".join(cut_result))

cut_result = jieba.cut(text, cut_all=False)  # Precise mode
print("\n Precise mode : " + "/ ".join(cut_result))

# Search engine mode
seg_result = jieba.cut_for_search(text)
print("\n Search engine mode : " + "/ ".join(seg_result))
<generator object Tokenizer.cut at 0x000001527DC81B48>

Full mode : Most/ gross/ part/ situation/ lower/ ,/ vocabulary/ yes/ We/ yes/ sentence/ Hewen/ article/ understand/ of/ Basics/ ,/ therefore/ need/ One/ tool/ go/ hold/ complete/ of/ text/ Middle score/ decompose/ Decompose into/ granularity/ more/ fine/ of/ Words/ . 

Precise mode : gross/ situation/ lower/ ,/ vocabulary/ yes/ We/ yes/ sentence/ and/ article/ understand/ of/ Basics/ ,/ therefore/ need/ One/ tool/ go/ hold/ complete/ of/ text/ in/ Decompose into/ granularity/ Finer/ of/ Words/ . 

Search engine mode : Most/ part/ gross/ situation/ lower/ ,/ vocabulary/ yes/ We/ yes/ sentence/ and/ article/ understand/ of/ Basics/ ,/ therefore/ need/ One/ tool/ go/ hold/ complete/ of/ text/ in/ decompose/ Decompose into/ granularity/ Finer/ of/ Words/ . 
  • Jieba.lcut and jieba.lcut_for_search returns the list directly
    1. Action and cut and cut_for_search has no difference, but the return values are different
str = "hello everyone, I like natural language processing very much"
lcut_result = jieba.lcut(str, cut_all=True)  # Full mode
print(" Full mode : ", lcut_result)

seg_lcut_result = jieba.lcut_for_search(str)
print(" search mode : ", seg_lcut_result)
 Full mode :  ['everybody', 'good', ',', ' ', '', 'I', 'very', 'like', 'natural', 'natural language', 'language', 'handle']
 search mode :  ['everybody', 'good', ',', ' ', 'I', 'very', 'like', 'natural', 'language', 'natural language', 'handle']

1.2 user defined dictionary

  • Function: many times we need to segment words according to our own scene, and there will be some proprietary words in the field.
  • Operation:
    1. You can use jieba.load_userdict(file_name) loads the user dictionary
    2. A small number of words can be added manually by using the following methods:
    3. Use add_word(word, freq=None, tag=None) and del_word(word) dynamically modifies the dictionary in the program
    4. Use suggest_freq(segment, tune=True) can adjust the word frequency of a single word so that it can (or cannot) be distinguished.
print("Before adjusting word frequency : ", '/ '.join(jieba.cut("If you put it in the old dictionary, there will be an error", HMM=False)))

# Adjust the frequency of "Zhong" and "Jiang" to ensure that the two words can be separated
print(jieba.suggest_freq(segment=('in', 'take'), tune=True))

print("After adjusting word frequency : ", '/ '.join(jieba.cut("If you put it in the old dictionary, there will be an error", HMM=False)))
Before adjusting word frequency :  If/ put to/ used/ Dictionaries/ Lieutenant general/ error
494
 After adjusting word frequency :  If/ put to/ used/ Dictionaries/ in/ take/ error

1.3 keyword extraction

  • Keyword extraction based on TF-IDF algorithm

import jieba.analyse
jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())

  • sentence is the text to be extracted
  • topK is to return several keywords with the largest TF/IDF weight. The default value is 20
  • withWeight is whether to return the keyword weight value together. The default value is False
  • allowPOS only includes words with the specified part of speech. The default value is empty, that is, it is not filtered
  • The return value is the list of extracted text: [Wei Shao ',' Durant ',...]
import jieba.analyse as analyse

sentence = open('./data/NBA.txt', encoding='utf-8').read()  # Read text

extract_res = analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())  # The return value is List
print(" ".join(extract_res))
Wei Shao Durant All Star Game MVP Willis is shooting for Cole, warrior player Brooke's locker NBA Zhang Weiping of sanlianzhuang guides the thunder star team in the West
  • Using TF-IDF to extract keywords from journey to the West
#  errors='ignore'ignore the portion of the read error
words = open(u'./data/Journey to the West.txt', errors='ignore').read()

extract_word = analyse.extract_tags(words, topK=10, withWeight=True, allowPOS=())
extract_word
[('Walker', 0.149712969050074),
 ('Bajie', 0.0684507752483343),
 ('master', 0.06131446245667119),
 ('Threefold Canon', 0.05297033399354041),
 ('Tang Monk', 0.034778995818127),
 ('Great sage', 0.0324254151715385),
 ('Monk Sha', 0.03158386691903323),
 ('Goblin', 0.02770001861295469),
 ('bodhisattva', 0.02576378770669382),
 ('buddhist monk', 0.024268051645726228)]






[('Walker', 0.149712969050074),
 ('Bajie', 0.0684507752483343),
 ('master', 0.06131446245667119),
 ('Threefold Canon', 0.05297033399354041),
 ('Tang Monk', 0.034778995818127),
 ('Great sage', 0.0324254151715385),
 ('Monk Sha', 0.03158386691903323),
 ('Goblin', 0.02770001861295469),
 ('bodhisattva', 0.02576378770669382),
 ('buddhist monk', 0.024268051645726228)]

1.4 keyword extraction supplement of TF-IDF algorithm

  • Introduction to TF-IDF algorithm:

    1. TF: the frequency of a word in a sentence. Calculation method: the number of times a word appears in the article / the total number of words in the article
    2. IDF: inverse document frequency, log (number of documents in corpus / number of documents containing the word + 1)
    3. TF-IDF = word frequency (TF) * inverse document frequency (IDF)
  • The inverse file frequency (IDF) text corpus used in keyword extraction can be switched to the path of custom corpus

  • Usage: jieba.analyze.set_ idf_ path(file_name) # file_ Name is the path of the custom corpus

    1. See for examples of custom corpora here
    2. See for usage examples here
  • The Stop Words text corpus used in keyword extraction can be switched to the path of custom corpus

    1. Usage: jieba.analyze.set_ stop_ words(file_name) # file_ Name is the path of the custom corpus
    2. See for examples of custom corpora here
    3. See for usage examples here
  • Example of keyword weight value returned together with keywords

  • See for usage examples here

1.5 keyword extraction based on TextRank algorithm

  • Jieba.analyze.textrank (sense, TOPK = 20, withweight = false, allowpos = ('ns',' n ',' vn ',' v ')) is used directly. The interface is the same. Note that the default filtering part of speech.
  • Jieba. Analyze. TextRank() create a new custom TextRank instance
  • Algorithm paper: TextRank: Bringing Order into Texts
  • Basic idea:
    1. Segment the text of the keyword to be extracted
    2. The graph is constructed with a fixed window size (5 by default, adjusted by span attribute) and the co-occurrence relationship between words
    3. Calculate the PageRank of the nodes in the graph. Note that it is an undirected weighted graph
import jieba.analyse as analyse

lines = open('./data/NBA.txt', encoding='utf-8').read()

# Keyword extraction using TextRank algorithm
# allowPOS only includes words with the specified part of speech. The default value is empty, that is, it is not filtered
word_tr = analyse.textrank(sentence=lines, topK=10, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))
print(' '.join(word_tr))

res = analyse.textrank(sentence=lines, topK=10, withWeight=False, allowPOS=('ns', 'n'))
print(' '.join(res))
The warriors of the All-Star game are guiding each other's shooting. There is no time for players to appear
 Warriors are playing All-Star game to guide shooting time, the other party's live results
words_my = open('./data/Journey to the West.txt', errors='ignore').read()

print(analyse.textrank(words_my, topK=10, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')))
print(analyse.textrank(words_my, topK=10, withWeight=True, allowPOS=('ns', 'n')))
['Walker', 'master', 'Bajie', 'Threefold Canon', 'Great sage', 'ignorance', 'bodhisattva', 'Goblin', 'See only', 'elders']
['Walker', 'master', 'Bajie', 'Threefold Canon', 'Great sage', 'bodhisattva', 'Goblin', 'king', 'elders', 'apprentice']
['Walker', 'master', 'Bajie', 'Threefold Canon', 'Great sage', 'ignorance', 'bodhisattva', 'Goblin', 'See only', 'elders']
[('Walker', 1.0), ('master', 0.4068394703674021), ('Bajie', 0.3983011869139073), ('Threefold Canon', 0.3907378862237123), ('Great sage', 0.24021798730344252), ('bodhisattva', 0.20177693035598557), ('Goblin', 0.18936895377629598), ('king', 0.15925294307325125), ('elders', 0.15196050918328696), ('apprentice', 0.10709412067136634)]

1.6 part of speech tagging

  • jieba.posseg.POSTokenizer(tokenizer=None)
    1. Create a new custom word breaker. The tokenizer parameter can specify the jieba.Tokenizer word breaker used internally. jieba.posseg.dt is the default part of speech tagging participle.
    2. Mark the part of speech of each word after sentence segmentation, and use the marking method compatible with ictclas.
    3. For the specific part of speech comparison table, see: Chinese part of speech tagging set
  • Introduction to common parts of speech
    1. a: Adjective j: abbreviation
    2. b: Distinguishing word k: subsequent component
    3. c: Conjunction m: numeral
    4. d: Adverb n: common noun
    5. e: Exclamation nh: name
    6. g: Morpheme ni: organization name
    7. h: Preceding component nl: locative NOUN
    8. i: Idiom ns: Place Names
    9. nt: time word nz: other proper names
    10. o: Onomatopoeia p: preposition
    11. q: Quantifier r: pronoun
    12. u: Auxiliary v: verb
    13. wp: punctuation ws: String
    14. x: Non morpheme words
import jieba.posseg as posseg

cut_result = posseg.cut('I like natural language processing very much')  # The return type is: generator

for word, flag in cut_result:
    print(" %s , %s" % (word, flag))
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\WANGTI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.628 seconds.
Prefix dict has been built successfully.


 I , r
 very , d
 like , v
 natural language , l
 handle , v

1.7 parallel word segmentation

  • Principle: after separating the target text by line, assign each line of text to multiple Python processes for parallel word segmentation, and then merge the results, so as to obtain a considerable improvement in word segmentation speed. Based on Python's own multiprocessing module, Windows is not supported at present

  • Usage:

    1. jieba.enable_parallel(4) # enables the parallel word segmentation mode. The parameter is the number of parallel processes
    2. jieba.disable_parallel() # turn off parallel word segmentation mode
  • Note: parallel word segmentation only supports the default word breakers jieba.dt and jieba.posseg.dt.

1.8 Tokenize: returns the start and end position of the word in the original text

  • Note that input parameters only accept unicode
  • To use, you need to add u = > u 'I love natural language processing' before the string
  • The return value is an iterator type. Traverse each item to get an array, item[0]: word, item[1]: start position, item[2]: end position
print("Default mode tokenize")

result_genera = jieba.tokenize(u'Natural language processing is used in many fields') # The return value type is an iterator
for tk in result_genera:
    print("%s\t\t start: %d \t\t end: %d" % (tk[0], tk[1], tk[2]))

print("=" * 40)

result_genera_search = jieba.tokenize(u'Natural language processing is used in many fields', mode='search') # The return value type is an iterator
for tk in result_genera_search:
    print("%s\t\t start: %d \t\t end: %d" % (tk[0], tk[1], tk[2]))
Default mode tokenize
 natural language		 start: 0 		 end: 4
 handle		 start: 4 		 end: 6
 stay		 start: 6 		 end: 7
 quite a lot		 start: 7 		 end: 9
 field		 start: 9 		 end: 11
 all		 start: 11 		 end: 12
 yes		 start: 12 		 end: 13
 application		 start: 13 		 end: 15
========================================
natural		 start: 0 		 end: 2
 language		 start: 2 		 end: 4
 natural language		 start: 0 		 end: 4
 handle		 start: 4 		 end: 6
 stay		 start: 6 		 end: 7
 quite a lot		 start: 7 		 end: 9
 field		 start: 9 		 end: 11
 all		 start: 11 		 end: 12
 yes		 start: 12 		 end: 13
 application		 start: 13 		 end: 15

1.9 command line word segmentation

  • Usage example: Python - M Jieba news.txt > cut_ result.txt

  • Command line options (translation):

    1. Using: python -m jieba [options] filename
  • Stutter command line interface.

  1. Fixed parameters:
  • filename input file
  1. Optional parameters:
    -h. -- help displays this help message and exits
    -d [DELIM], --delimiter [DELIM]
    Use DELIM to separate words instead of the default '/'.
    If DELIM is not specified, it is separated by a space.
    -p [DELIM], --pos [DELIM]
    Enable part of speech tagging; If DELIM is specified, between words and parts of speech
    Use it to separate, otherwise use it_ separate
    -D DICT, --dict DICT use DICT instead of the default dictionary
    -u USER_DICT, --user-dict USER_DICT
    Use user_ As an additional dictionary, dict is used with the default dictionary or custom dictionary
    -a. -- cut all full mode word segmentation (part of speech tagging is not supported)
    -n. -- no HMM does not use hidden Markov model
    -q. -- quiet does not output loading information to STDERR
    -5. -- version displays version information and exits
  • If no file name is specified, standard input is used.

Posted by bgomillion on Sat, 18 Sep 2021 04:13:59 -0700