Lesson 1: cloud picture of words

Keywords: Python NLP

1. Download jieba participle and wordcloud

Pip3 install jieba (3 may need to be removed)

2. Open + name the text to generate word cloud

Use with open as

3. Participle

Import custom dictionary (load_userdict; sep_list)

4. Statistics of word frequency

Define an empty dictionary; Use cycle                                                                                                                 

5. Add stop words

Put the words separated from the text in a list;

Import + named stop word text

Cycle (setting condition: word frequency < 20 / single word len / medium stop word)

Delete the inconsistent word names.pop()

******Get the classified words and their word frequency above******

 

6. Confirm and modify the path

import os; os.getcwd() current; os.chdir(path) modification

7. Generate word cloud

Names = wordcloud (requirements in the figure). Generate_ from_ Frequencies (Dictionary of words)

8. Use fonts

font=r'path (position of font in the computer)

9. Finally generate word cloud

Plt.imshow; plt.axis(off) delete the coordinate axis; plt.show display picture

******Above square word cloud chart******  

10. Set of graphics

from PIL import Image

import numpy as np

Also add some requirements to the requirements in the figure above

 

The code is as follows:

import jieba
with open('China145.txt','r',encoding='utf-8')as f:#r: Open the file read-only. The pointer to the file will be placed at the beginning of the file.
    renmin=f.read()#renmin is his own name

#participle
jieba.load_userdict('China145cut.txt')#load_userdict means importing a custom dictionary. What's the use of this step?
seg_list=jieba.cut(renmin,cut_all=False)#false indicates the exact mode is used

#Statistical word frequency
tf={}#tf means to define the dictionary name as
for seg in seg_list:
    if seg in tf:
        tf[seg]+=1#[SEG] brackets are the values corresponding to seg
    else:
        tf[seg]=1

#Add stop words
ci=list(tf.keys())#Make keys into a list called ci,. Keys is the list that returns all keys
with open('chinesestopwords.txt','r',encoding='utf-8')as ft:
    stopword=ft.read()#Make a stop word tuple (from your own txt) called stopword
    
for seg in ci:
    if tf[seg]<20 or len(seg)<2 or seg in stopword or "-" in seg:
        tf.pop(seg)#. pop means to delete the specified key
        
print(tf)

import os
print(os.getcwd())
from wordcloud import WordCloud
import matplotlib.pyplot as plt

#Add shape
from PIL import Image
import numpy as np#That is, for the convenience of writing programs, numpy is nicknamed np; Numpy is an extension library of the Python language

mask=np.array(Image.open('heart.jpg'))
#Add font
font=r'c:\Windows\Fonts\simfang.ttf'
wc=WordCloud(background_color='white',mask=mask,font_path=font,width=800,height=600).generate_from_frequencies(tf)#What's the difference between the generate here and the generate above?

plt.imshow(wc)
plt.axis('off')
plt.show()#Display image
wc.to_file('wc.jpg')#Generate jpg

Relevant knowledge:

[text cannot be parsed] note that when creating a new text, save it as, and select the encoding method as uft-8

[encoding='utf-8 '] if the display cannot be read, add this

[os module] provides methods to process files and directories

os.getcwd() returns the current working directory    

os.chdir(path) changes the current working directory

[jieba participle]

Related links: python stuttering word segmentation learning - Liu Shuai - blog Garden

[with open as] read and write files

r means open in read-only mode, and the pointer is at the beginning of the text

Related links: python uses with open() as to read and write files_ xrinosvip blog - CSDN blog

[imshow] heat map is a common method of data analysis. It shows the difference of data through color difference and brightness.

Related links: plt.imshow()_ Small program scarlet blog - CSDN blog_ plt.imshow

[PIL] picture processing module

Related links: https://www.jb51.net/article/184195.htm

After class practice

Fang Siqi's first love paradise is a story about Fang Siqi, a girl who loves literature, who was sexually assaulted by her teacher Li Guohua and finally led to mental collapse. According to the word cloud picture (excluding the name of the protagonist), the main keywords of this paper are: teacher, like, no, sister, don't, etc. For the girl Siqi, Li Guohua was initially a respectable Chinese teacher with deep attainments in literature. However, under the guidance of the teacher, Siqi could not resist and felt very painful. She had to force herself to "like" the teacher for a moment of relief. Sister is Siqi's neighbor. She is a young woman who also loves literature but has been in a domestic violence environment for a long time, which is also another clue of this article. Sister is Siqi's comfort, but her experience also imperceptibly affects Siqi.

 

Posted by TutorMe on Fri, 17 Sep 2021 16:10:35 -0700