# 042 Example 10-Text Word Frequency Statistics

Keywords: Lambda encoding

;

# 1. Analysis of "Text Word Frequency Statistics"

## 1.1 Problem Analysis

Text Frequency Statistics

• Requirements: What words appear in an article?Which words appear the most?
• What should I do?

English Text-->Chinese Text

• English text: Hamlet analysis of word frequency

Students who want the Hamlet text can send me WeChat: nickchen121

• Chinese Text: Analytical Characters in the Romance of the Three Kingdoms

Students who want the text of the Three Kingdoms can add me WeChat: nickchen121

# 2. "Hamlet English Word Frequency Statistics" Example Explanation

• Text denoising and normalization
• Use dictionaries to express word frequencies
``````# CalHamletV1.py

def getText():
txt = open("hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#\$%&()*+,-./:;<=>?@[\\]^_'{|}~':
txt = txt.replace(ch, " ")
return txt

hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
``````
``````the         948
and         855
to          650
of          581
you         494
a           468
my          447
i           443
in          373
hamlet      361
``````
• Run results sorted from large to small
• Observe the number of occurrences of words

# 3. Explanation of "Person appearance statistics in the Romance of the Three Kingdoms" as an example (1)

• Chinese text participle
• Use dictionaries to express word frequencies
``````# CalThreeKingdomsV1.py

import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(15):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
``````
```Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/mh/krrg51957cqgl0rhgnwyylvc0000gn/T/jieba.cache
Loading model cost 1.030 seconds.
Prefix dict has been built succesfully.

Cao Cao 953
Kongming 836
General 772
But say 656
Xuande 585
Guan Gong 510
Premier 491
Two people 469
Not 440
Jingzhou 425
Xuan De said 390
Kong Ming said 390
Cannot 384
So 378
Zhang Fei 358
```

# 4. Explanation of "Person Appearance Statistics in the Romance of the Three Kingdoms"(2)

## 4.1 Stats of Characters in the Romance of the Three Kingdoms

Associate word frequency with characters, problem-oriented

Word Frequency Statistics-->Person Statistics

``````#CalThreeKingdomsV2.py
import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"General", "But say", "Jingzhou", "Two people", "Must not", "cannot", "such"}
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "Zhuge Liang" or word == "Kong Ming said":
rword = "Kong Ming"
elif word == "Guan Yu" or word == "Cloud length":
rword = "Guan Yu"
elif word == "Xuande" or word == "Xuan De said":
rword = "Liu Bei"
elif word == "Mende" or word == "The prime minister":
rword = "Cao Cao"
else:
rword = word
counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
``````
```Cao Cao 1451
Kongming 1383
Liu Bei 1252
Guan Yu 784
Zhang Fei 358
Negotiation 344
How to 338
Prince 331
Sergeant 317
Lvb 300
```
• Chinese text participle

• Use dictionaries to express word frequencies

• Extender Solve Problem

• Further optimization based on results

The first 20 entries of the Three Kingdoms were solemnly released: Cao Cao, Kong Ming, Liu Bei, Guan Yu, Zhang Fei, Lv Bu, Zhao Yun, Sun Quan, Sima Yi, Zhou Yu, Yuan Shao, Ma Chao, Wei Yan, Huang Zhong, Jiang Wei, Ma Dai, Pound, Meng Huo, Liu Bie, Xia Hou-tuo

# 5. "Text word frequency statistics"

## 5.1 Extension of application issues

• Dream of Red Chamber, Journey to the West, Water Margin...
• Government work reports, scientific research papers, news reports...
• Further?There will be a word cloud in the future...

Posted by mkoga on Mon, 08 Jun 2020 17:09:48 -0700