1. jieba Chinese word segmentation
import jieba text = "In most cases, vocabulary is the basis of our understanding of sentences and articles, so we need a tool to decompose the complete text into finer grained words." cut_result = jieba.cut(text, cut_all=True) # Full mode print(cut_result) print("\n Full mode : " + "/ ".join(cut_result)) cut_result = jieba.cut(text, cut_all=False) # Precise mode print("\n Precise mode : " + "/ ".join(cut_result)) # Search engine mode seg_result = jieba.cut_for_search(text) print("\n Search engine mode : " + "/ ".join(seg_result))
<generator object Tokenizer.cut at 0x000001527DC81B48> Full mode : Most/ gross/ part/ situation/ lower/ ,/ vocabulary/ yes/ We/ yes/ sentence/ Hewen/ article/ understand/ of/ Basics/ ,/ therefore/ need/ One/ tool/ go/ hold/ complete/ of/ text/ Middle score/ decompose/ Decompose into/ granularity/ more/ fine/ of/ Words/ . Precise mode : gross/ situation/ lower/ ,/ vocabulary/ yes/ We/ yes/ sentence/ and/ article/ understand/ of/ Basics/ ,/ therefore/ need/ One/ tool/ go/ hold/ complete/ of/ text/ in/ Decompose into/ granularity/ Finer/ of/ Words/ . Search engine mode : Most/ part/ gross/ situation/ lower/ ,/ vocabulary/ yes/ We/ yes/ sentence/ and/ article/ understand/ of/ Basics/ ,/ therefore/ need/ One/ tool/ go/ hold/ complete/ of/ text/ in/ decompose/ Decompose into/ granularity/ Finer/ of/ Words/ .
- Jieba.lcut and jieba.lcut_for_search returns the list directly
- Action and cut and cut_for_search has no difference, but the return values are different
str = "hello everyone, I like natural language processing very much" lcut_result = jieba.lcut(str, cut_all=True) # Full mode print(" Full mode : ", lcut_result) seg_lcut_result = jieba.lcut_for_search(str) print(" search mode : ", seg_lcut_result)
Full mode : ['everybody', 'good', ',', ' ', '', 'I', 'very', 'like', 'natural', 'natural language', 'language', 'handle'] search mode : ['everybody', 'good', ',', ' ', 'I', 'very', 'like', 'natural', 'language', 'natural language', 'handle']
1.2 user defined dictionary
- Function: many times we need to segment words according to our own scene, and there will be some proprietary words in the field.
- Operation:
- You can use jieba.load_userdict(file_name) loads the user dictionary
- A small number of words can be added manually by using the following methods:
- Use add_word(word, freq=None, tag=None) and del_word(word) dynamically modifies the dictionary in the program
- Use suggest_freq(segment, tune=True) can adjust the word frequency of a single word so that it can (or cannot) be distinguished.
print("Before adjusting word frequency : ", '/ '.join(jieba.cut("If you put it in the old dictionary, there will be an error", HMM=False))) # Adjust the frequency of "Zhong" and "Jiang" to ensure that the two words can be separated print(jieba.suggest_freq(segment=('in', 'take'), tune=True)) print("After adjusting word frequency : ", '/ '.join(jieba.cut("If you put it in the old dictionary, there will be an error", HMM=False)))
Before adjusting word frequency : If/ put to/ used/ Dictionaries/ Lieutenant general/ error 494 After adjusting word frequency : If/ put to/ used/ Dictionaries/ in/ take/ error
1.3 keyword extraction
- Keyword extraction based on TF-IDF algorithm
import jieba.analyse
jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
- sentence is the text to be extracted
- topK is to return several keywords with the largest TF/IDF weight. The default value is 20
- withWeight is whether to return the keyword weight value together. The default value is False
- allowPOS only includes words with the specified part of speech. The default value is empty, that is, it is not filtered
- The return value is the list of extracted text: [Wei Shao ',' Durant ',...]
import jieba.analyse as analyse sentence = open('./data/NBA.txt', encoding='utf-8').read() # Read text extract_res = analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=()) # The return value is List print(" ".join(extract_res))
Wei Shao Durant All Star Game MVP Willis is shooting for Cole, warrior player Brooke's locker NBA Zhang Weiping of sanlianzhuang guides the thunder star team in the West
- Using TF-IDF to extract keywords from journey to the West
# errors='ignore'ignore the portion of the read error words = open(u'./data/Journey to the West.txt', errors='ignore').read() extract_word = analyse.extract_tags(words, topK=10, withWeight=True, allowPOS=()) extract_word
[('Walker', 0.149712969050074), ('Bajie', 0.0684507752483343), ('master', 0.06131446245667119), ('Threefold Canon', 0.05297033399354041), ('Tang Monk', 0.034778995818127), ('Great sage', 0.0324254151715385), ('Monk Sha', 0.03158386691903323), ('Goblin', 0.02770001861295469), ('bodhisattva', 0.02576378770669382), ('buddhist monk', 0.024268051645726228)] [('Walker', 0.149712969050074), ('Bajie', 0.0684507752483343), ('master', 0.06131446245667119), ('Threefold Canon', 0.05297033399354041), ('Tang Monk', 0.034778995818127), ('Great sage', 0.0324254151715385), ('Monk Sha', 0.03158386691903323), ('Goblin', 0.02770001861295469), ('bodhisattva', 0.02576378770669382), ('buddhist monk', 0.024268051645726228)]
1.4 keyword extraction supplement of TF-IDF algorithm
-
Introduction to TF-IDF algorithm:
- TF: the frequency of a word in a sentence. Calculation method: the number of times a word appears in the article / the total number of words in the article
- IDF: inverse document frequency, log (number of documents in corpus / number of documents containing the word + 1)
- TF-IDF = word frequency (TF) * inverse document frequency (IDF)
-
The inverse file frequency (IDF) text corpus used in keyword extraction can be switched to the path of custom corpus
-
Usage: jieba.analyze.set_ idf_ path(file_name) # file_ Name is the path of the custom corpus
-
The Stop Words text corpus used in keyword extraction can be switched to the path of custom corpus
-
Example of keyword weight value returned together with keywords
-
See for usage examples here
1.5 keyword extraction based on TextRank algorithm
- Jieba.analyze.textrank (sense, TOPK = 20, withweight = false, allowpos = ('ns',' n ',' vn ',' v ')) is used directly. The interface is the same. Note that the default filtering part of speech.
- Jieba. Analyze. TextRank() create a new custom TextRank instance
- Algorithm paper: TextRank: Bringing Order into Texts
- Basic idea:
- Segment the text of the keyword to be extracted
- The graph is constructed with a fixed window size (5 by default, adjusted by span attribute) and the co-occurrence relationship between words
- Calculate the PageRank of the nodes in the graph. Note that it is an undirected weighted graph
import jieba.analyse as analyse lines = open('./data/NBA.txt', encoding='utf-8').read() # Keyword extraction using TextRank algorithm # allowPOS only includes words with the specified part of speech. The default value is empty, that is, it is not filtered word_tr = analyse.textrank(sentence=lines, topK=10, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) print(' '.join(word_tr)) res = analyse.textrank(sentence=lines, topK=10, withWeight=False, allowPOS=('ns', 'n')) print(' '.join(res))
The warriors of the All-Star game are guiding each other's shooting. There is no time for players to appear Warriors are playing All-Star game to guide shooting time, the other party's live results
words_my = open('./data/Journey to the West.txt', errors='ignore').read() print(analyse.textrank(words_my, topK=10, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))) print(analyse.textrank(words_my, topK=10, withWeight=True, allowPOS=('ns', 'n')))
['Walker', 'master', 'Bajie', 'Threefold Canon', 'Great sage', 'ignorance', 'bodhisattva', 'Goblin', 'See only', 'elders'] ['Walker', 'master', 'Bajie', 'Threefold Canon', 'Great sage', 'bodhisattva', 'Goblin', 'king', 'elders', 'apprentice'] ['Walker', 'master', 'Bajie', 'Threefold Canon', 'Great sage', 'ignorance', 'bodhisattva', 'Goblin', 'See only', 'elders'] [('Walker', 1.0), ('master', 0.4068394703674021), ('Bajie', 0.3983011869139073), ('Threefold Canon', 0.3907378862237123), ('Great sage', 0.24021798730344252), ('bodhisattva', 0.20177693035598557), ('Goblin', 0.18936895377629598), ('king', 0.15925294307325125), ('elders', 0.15196050918328696), ('apprentice', 0.10709412067136634)]
1.6 part of speech tagging
- jieba.posseg.POSTokenizer(tokenizer=None)
- Create a new custom word breaker. The tokenizer parameter can specify the jieba.Tokenizer word breaker used internally. jieba.posseg.dt is the default part of speech tagging participle.
- Mark the part of speech of each word after sentence segmentation, and use the marking method compatible with ictclas.
- For the specific part of speech comparison table, see: Chinese part of speech tagging set
- Introduction to common parts of speech
- a: Adjective j: abbreviation
- b: Distinguishing word k: subsequent component
- c: Conjunction m: numeral
- d: Adverb n: common noun
- e: Exclamation nh: name
- g: Morpheme ni: organization name
- h: Preceding component nl: locative NOUN
- i: Idiom ns: Place Names
- nt: time word nz: other proper names
- o: Onomatopoeia p: preposition
- q: Quantifier r: pronoun
- u: Auxiliary v: verb
- wp: punctuation ws: String
- x: Non morpheme words
import jieba.posseg as posseg cut_result = posseg.cut('I like natural language processing very much') # The return type is: generator for word, flag in cut_result: print(" %s , %s" % (word, flag))
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\WANGTI~1\AppData\Local\Temp\jieba.cache Loading model cost 0.628 seconds. Prefix dict has been built successfully. I , r very , d like , v natural language , l handle , v
1.7 parallel word segmentation
-
Principle: after separating the target text by line, assign each line of text to multiple Python processes for parallel word segmentation, and then merge the results, so as to obtain a considerable improvement in word segmentation speed. Based on Python's own multiprocessing module, Windows is not supported at present
-
Usage:
- jieba.enable_parallel(4) # enables the parallel word segmentation mode. The parameter is the number of parallel processes
- jieba.disable_parallel() # turn off parallel word segmentation mode
-
Note: parallel word segmentation only supports the default word breakers jieba.dt and jieba.posseg.dt.
1.8 Tokenize: returns the start and end position of the word in the original text
- Note that input parameters only accept unicode
- To use, you need to add u = > u 'I love natural language processing' before the string
- The return value is an iterator type. Traverse each item to get an array, item[0]: word, item[1]: start position, item[2]: end position
print("Default mode tokenize") result_genera = jieba.tokenize(u'Natural language processing is used in many fields') # The return value type is an iterator for tk in result_genera: print("%s\t\t start: %d \t\t end: %d" % (tk[0], tk[1], tk[2])) print("=" * 40) result_genera_search = jieba.tokenize(u'Natural language processing is used in many fields', mode='search') # The return value type is an iterator for tk in result_genera_search: print("%s\t\t start: %d \t\t end: %d" % (tk[0], tk[1], tk[2]))
Default mode tokenize natural language start: 0 end: 4 handle start: 4 end: 6 stay start: 6 end: 7 quite a lot start: 7 end: 9 field start: 9 end: 11 all start: 11 end: 12 yes start: 12 end: 13 application start: 13 end: 15 ======================================== natural start: 0 end: 2 language start: 2 end: 4 natural language start: 0 end: 4 handle start: 4 end: 6 stay start: 6 end: 7 quite a lot start: 7 end: 9 field start: 9 end: 11 all start: 11 end: 12 yes start: 12 end: 13 application start: 13 end: 15
1.9 command line word segmentation
-
Usage example: Python - M Jieba news.txt > cut_ result.txt
-
Command line options (translation):
- Using: python -m jieba [options] filename
-
Stutter command line interface.
- Fixed parameters:
- filename input file
- Optional parameters:
-h. -- help displays this help message and exits
-d [DELIM], --delimiter [DELIM]
Use DELIM to separate words instead of the default '/'.
If DELIM is not specified, it is separated by a space.
-p [DELIM], --pos [DELIM]
Enable part of speech tagging; If DELIM is specified, between words and parts of speech
Use it to separate, otherwise use it_ separate
-D DICT, --dict DICT use DICT instead of the default dictionary
-u USER_DICT, --user-dict USER_DICT
Use user_ As an additional dictionary, dict is used with the default dictionary or custom dictionary
-a. -- cut all full mode word segmentation (part of speech tagging is not supported)
-n. -- no HMM does not use hidden Markov model
-q. -- quiet does not output loading information to STDERR
-5. -- version displays version information and exits
- If no file name is specified, standard input is used.