- 1. Analysis of "Text Word Frequency Statistics"
- 2. "Hamlet English Word Frequency Statistics" Example Explanation
- 3. Explanation of "Person appearance statistics in the Romance of the Three Kingdoms" as an example (1)
- 4. Explanation of "Person Appearance Statistics in the Romance of the Three Kingdoms"(2)
- 5. "Text word frequency statistics"
1. Analysis of "Text Word Frequency Statistics"
1.1 Problem Analysis
Text Frequency Statistics
- Requirements: What words appear in an article?Which words appear the most?
- What should I do?
English Text-->Chinese Text
- English text: Hamlet analysis of word frequency
Students who want the Hamlet text can send me WeChat: nickchen121
- Chinese Text: Analytical Characters in the Romance of the Three Kingdoms
Students who want the text of the Three Kingdoms can add me WeChat: nickchen121
2. "Hamlet English Word Frequency Statistics" Example Explanation
- Text denoising and normalization
- Use dictionaries to express word frequencies
# CalHamletV1.py def getText(): txt = open("hamlet.txt", "r").read() txt = txt.lower() for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_'{|}~': txt = txt.replace(ch, " ") return txt hamletTxt = getText() words = hamletTxt.split() counts = {} for word in words: counts[word] = counts.get(word, 0) + 1 items = list(counts.items()) items.sort(key=lambda x: x[1], reverse=True) for i in range(10): word, count = items[i] print("{0:<10}{1:>5}".format(word, count))
the 948 and 855 to 650 of 581 you 494 a 468 my 447 i 443 in 373 hamlet 361
- Run results sorted from large to small
- Observe the number of occurrences of words
3. Explanation of "Person appearance statistics in the Romance of the Three Kingdoms" as an example (1)
- Chinese text participle
- Use dictionaries to express word frequencies
# CalThreeKingdomsV1.py import jieba txt = open("threekingdoms.txt", "r", encoding="utf-8").read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word, 0) + 1 items = list(counts.items()) items.sort(key=lambda x: x[1], reverse=True) for i in range(15): word, count = items[i] print("{0:<10}{1:>5}".format(word, count))
Building prefix dict from the default dictionary ... Loading model from cache /var/folders/mh/krrg51957cqgl0rhgnwyylvc0000gn/T/jieba.cache Loading model cost 1.030 seconds. Prefix dict has been built succesfully. Cao Cao 953 Kongming 836 General 772 But say 656 Xuande 585 Guan Gong 510 Premier 491 Two people 469 Not 440 Jingzhou 425 Xuan De said 390 Kong Ming said 390 Cannot 384 So 378 Zhang Fei 358
4. Explanation of "Person Appearance Statistics in the Romance of the Three Kingdoms"(2)
4.1 Stats of Characters in the Romance of the Three Kingdoms
Associate word frequency with characters, problem-oriented
Word Frequency Statistics-->Person Statistics
#CalThreeKingdomsV2.py import jieba txt = open("threekingdoms.txt", "r", encoding="utf-8").read() excludes = {"General", "But say", "Jingzhou", "Two people", "Must not", "cannot", "such"} words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word == "Zhuge Liang" or word == "Kong Ming said": rword = "Kong Ming" elif word == "Guan Yu" or word == "Cloud length": rword = "Guan Yu" elif word == "Xuande" or word == "Xuan De said": rword = "Liu Bei" elif word == "Mende" or word == "The prime minister": rword = "Cao Cao" else: rword = word counts[rword] = counts.get(rword, 0) + 1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x: x[1], reverse=True) for i in range(10): word, count = items[i] print("{0:<10}{1:>5}".format(word, count))
Cao Cao 1451 Kongming 1383 Liu Bei 1252 Guan Yu 784 Zhang Fei 358 Negotiation 344 How to 338 Prince 331 Sergeant 317 Lvb 300
-
Chinese text participle
-
Use dictionaries to express word frequencies
-
Extender Solve Problem
-
Further optimization based on results
The first 20 entries of the Three Kingdoms were solemnly released: Cao Cao, Kong Ming, Liu Bei, Guan Yu, Zhang Fei, Lv Bu, Zhao Yun, Sun Quan, Sima Yi, Zhou Yu, Yuan Shao, Ma Chao, Wei Yan, Huang Zhong, Jiang Wei, Ma Dai, Pound, Meng Huo, Liu Bie, Xia Hou-tuo
5. "Text word frequency statistics"
5.1 Extension of application issues
- Dream of Red Chamber, Journey to the West, Water Margin...
- Government work reports, scientific research papers, news reports...
- Further?There will be a word cloud in the future...