[talking about python crawler 2] etree method based on lxml library combined with xpath method -- crawling the contents of the ranking list and generating the word cloud map of the ranking list

Keywords: Python html5 crawler Data Analysis Data Mining

Hello, everyone. I'm a studious junior brother. Today, I will continue to explain the second method I wrote: etree method based on lxml library combined with xpath method - crawling the contents of the ranking list and generating the word cloud map.

The learning experience is mainly divided into three lectures:

1. Crawler based on regular expression - crawling the contents of the ranking list

2. etree method based on lxml library combined with xpath method -- crawling the contents of the list and generating the cloud picture of the list words

3. Call the crawler method of the ranking list based on the interface and save the crawled content to the csv file

etree method based on lxml library combined with xpath method -- crawling the contents of the ranking list and generating the word cloud map of the ranking list

Idea: first print the text content of the web page through the requests library, and then convert the printed text content into html through the etree method in the lxml library. After converting to html, we can use xpath to locate the text content of the tag, that is, the content we want to crawl. Then import wordcloud word cloud Library (generate word cloud), jieba sub word library (divide the generated word cloud into words to show better), matplotlib Library (draw word cloud image module), and finally import the word cloud background image you want to generate to complete this crawling task.

Step 1: print out the web page content through the requests library; Then use etree to transform the text content into html web page format

import requests
#Import etree method function from lxml Library
from lxml import etree

csdn_python_url='https://blog.csdn.net/nav/python'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
python_yemian=requests.get(url=csdn_python_url,headers=headers)
python_yemian.encoding="utf-8"
#lxml is introduced here, and the text text crawled down can be used as html again with etree.HTML method
html = etree.HTML(python_yemian.text)

'''
there html Has been transformed by us into html Web page format
 Can proceed xpath Location


'''

Step 2: xpath locates the label text content

Brief overview of xpath positioning methods:

Crawl the content by positioning

'''
list_rebang Is a list
 Note that what we need here is the text content of the label, so finally locate the position
 We need to add one text()

'''
list_rebang=html.xpath('//ul[@id="feedlist_id"]/li/div/div/h2/a/text()')
#Convert list to string
str_rebang=''.join(list_rebang)
#Replace the blanks
str_rebang=str_rebang.replace(' ','')
#print(str_rebang)

Note: I forgot to add attribute to the positioning tag in the above figure

Details: through observation, we find that the first few li tags and div tags in the above label have many repetitions, so how should we locate them?

Answer: if you want to locate the first li tag, li[1]; If you want to locate the first div tag in the li tag, click / li/div[1];

(the positioning here starts from 1, not 0. Note)

Step 3: generate word cloud

#jieba word segmentation divides sentences into words separated by spaces
cut_text = ' '.join(jieba.cut(str_rebang))
print(cut_text)

#Font library location, which font is used when displaying word cloud
font_path = r'D:\fang_zheng_zitiku\FZSSJW.TTF'

#Location of word cloud background map
background_jpg_path = r'D:\python_pachong_csdn_rebang\Word cloud background picture\Word cloud background figure 2.jpg'
#Generate background map
background_image = np.array(Image.open(background_jpg_path))

#To form a word cloud, mask is the parameter size to obtain the background image
wordcloud =WordCloud(font_path,background_color="white",mask=background_image).generate(cut_text)




# Generate color values
image_colors = ImageColorGenerator(background_image)


#Draw word cloud
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
#Close axis
plt.axis("off")
#Drawing display
plt.show()

Full code:

import requests
from lxml import etree
#Drawing image module
import matplotlib.pyplot as plt
#jieba plays the role of word segmentation
import jieba
#When you download wordcloud, you automatically download pIL and numpy
from PIL import Image

import numpy as np

from wordcloud import WordCloud, ImageColorGenerator

csdn_python_url='https://blog.csdn.net/nav/python'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
python_yemian=requests.get(url=csdn_python_url,headers=headers)
python_yemian.encoding="utf-8"
#lxml is introduced here, and etree can be used as html to crawl down the text text
html = etree.HTML(python_yemian.text)
#name
#list_ Rebang text is followed by a list
list_rebang=html.xpath('//ul[@id="feedlist_id"]/li/div/div/h2/a/text()')
#Convert list to string
str_rebang=''.join(list_rebang)
#Replace spaces with strings
str_rebang=str_rebang.replace(' ','')
#print(str_rebang)

#jieba word segmentation divides sentences into words separated by spaces
cut_text = ' '.join(jieba.cut(str_rebang))
print(cut_text)

#Font library location
font_path = r'D:\fang_zheng_zitiku\FZSSJW.TTF'

#Location of word cloud background map
background_jpg_path = r'D:\python_pachong_csdn_rebang\Word cloud background picture\Word cloud background figure 2.jpg'

background_image = np.array(Image.open(background_jpg_path))

#Constituent word cloud
wordcloud =WordCloud(font_path,background_color="white",mask=background_image).generate(cut_text)
#Draw word cloud
#interpolation="bilinear" I can't understand bilinear interpolation...


# Generate color values
image_colors = ImageColorGenerator(background_image)



plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
#Close axis
plt.axis("off")
#Drawing display
plt.show()

design sketch:

It's not easy for newcomers to create. I think it's good to watch the official. Give me a praise, mmda!!!

Reprint indicate the source!

Posted by liquidchild_au on Fri, 19 Nov 2021 21:50:45 -0800

Programmer Group

[talking about python crawler 2] etree method based on lxml library combined with xpath method -- crawling the contents of the ranking list and generating the word cloud map of the ranking list

Hot Keywords