042 Example 10-Text Word Frequency Statistics

Keywords: Lambda encoding

Catalog

1. Analysis of "Text Word Frequency Statistics"

1.1 Problem Analysis

Text Frequency Statistics

  • Requirements: What words appear in an article?Which words appear the most?
  • What should I do?

English Text-->Chinese Text

  • English text: Hamlet analysis of word frequency

Students who want the Hamlet text can send me WeChat: nickchen121

  • Chinese Text: Analytical Characters in the Romance of the Three Kingdoms

Students who want the text of the Three Kingdoms can add me WeChat: nickchen121

2. "Hamlet English Word Frequency Statistics" Example Explanation

  • Text denoising and normalization
  • Use dictionaries to express word frequencies
# CalHamletV1.py


def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_'{|}~':
        txt = txt.replace(ch, " ")
    return txt


hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
    counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))
the         948
and         855
to          650
of          581
you         494
a           468
my          447
i           443
in          373
hamlet      361
  • Run results sorted from large to small
  • Observe the number of occurrences of words

3. Explanation of "Person appearance statistics in the Romance of the Three Kingdoms" as an example (1)

  • Chinese text participle
  • Use dictionaries to express word frequencies
# CalThreeKingdomsV1.py

import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/mh/krrg51957cqgl0rhgnwyylvc0000gn/T/jieba.cache
Loading model cost 1.030 seconds.
Prefix dict has been built succesfully.


Cao Cao 953
 Kongming 836
 General 772
 But say 656
 Xuande 585
 Guan Gong 510
 Premier 491
 Two people 469
 Not 440
 Jingzhou 425
 Xuan De said 390
 Kong Ming said 390
 Cannot 384
 So 378
 Zhang Fei 358

4. Explanation of "Person Appearance Statistics in the Romance of the Three Kingdoms"(2)

4.1 Stats of Characters in the Romance of the Three Kingdoms

Associate word frequency with characters, problem-oriented

Word Frequency Statistics-->Person Statistics

#CalThreeKingdomsV2.py
import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"General", "But say", "Jingzhou", "Two people", "Must not", "cannot", "such"}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "Zhuge Liang" or word == "Kong Ming said":
        rword = "Kong Ming"
    elif word == "Guan Yu" or word == "Cloud length":
        rword = "Guan Yu"
    elif word == "Xuande" or word == "Xuan De said":
        rword = "Liu Bei"
    elif word == "Mende" or word == "The prime minister":
        rword = "Cao Cao"
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))
Cao Cao 1451
 Kongming 1383
 Liu Bei 1252
 Guan Yu 784
 Zhang Fei 358
 Negotiation 344
 How to 338
 Prince 331
 Sergeant 317
 Lvb 300
  • Chinese text participle

  • Use dictionaries to express word frequencies

  • Extender Solve Problem

  • Further optimization based on results

The first 20 entries of the Three Kingdoms were solemnly released: Cao Cao, Kong Ming, Liu Bei, Guan Yu, Zhang Fei, Lv Bu, Zhao Yun, Sun Quan, Sima Yi, Zhou Yu, Yuan Shao, Ma Chao, Wei Yan, Huang Zhong, Jiang Wei, Ma Dai, Pound, Meng Huo, Liu Bie, Xia Hou-tuo

5. "Text word frequency statistics"

5.1 Extension of application issues

  • Dream of Red Chamber, Journey to the West, Water Margin...
  • Government work reports, scientific research papers, news reports...
  • Further?There will be a word cloud in the future...

Posted by mkoga on Mon, 08 Jun 2020 17:09:48 -0700