1, requests+lxml non login crawling Douban data
1. First, open Douban and find the url of the short movie review you want to crawl, for example:
2. Open the developer tools in the web page, and most computers can press F12 directly, but there are also differences. For example, ThinkPad uses Fn+F12 to open, find the network in the page - > click the first line - > find the message header on the right - > slide down, find the request header, and record the user agent
A kind of Note: the user agent here contains browser related information and hardware device information, etc., which can be disguised as a legitimate user request, otherwise error 403: access to files or directories on the server is denied
3. Code part
#Leading package import requests as rq from lxml import etree #url+headers information url = 'https://movie.douban.com/subject/1292001/' headers = {'User-Agent':'****'}
4. Get web page data (html style of web page)
A kind of The get() method is used in the non login state
data = rq.get(url,headers=headers).text
Output result of data:
5. Analyze the web page after getting the data of the web page
s = etree.HTML(data)
6. Next, get the element information in the web page, such as: movie name, director, actor, duration, etc.. You can get the web page information manually, as shown in the following figure:
Right click the target tag - > copy - > XPath
7. Code part
#The default return is the list list format film = s.xpath('/html/body/div[3]/div[1]/h1/span[1]/text()') director = s.xpath('/html/body/div[3]/div[1]/div[3]/div[1]/div[1]/div[1]/div[1]/div[2]/span[1]/span[2]/a/text()') Starring = s.xpath('/html/body/div[3]/div[1]/div[3]/div[1]/div[1]/div[1]/div[1]/div[2]/span[3]/span[2]/a/text()') duration = s.xpath('/html/body/div[3]/div[1]/div[3]/div[1]/div[1]/div[1]/div[1]/div[2]/span[13]/text()') print("Movie name:" , film[0]) print("director:",director[0]) print("Starring:",Starring[0]) print("duration:",duration[0])
2, Analog login Douban crawling short movie review
1. Simulated Login Douban
(1) There are many ways to log in Douban. Here, choose password to log in
First, we need to get the url of password login. The url here is not the url address in the browser, but the post request in the network transmission. First, enter the wrong account and password to get the url
Click post, here is the real request URL
Next, prepare the header information, slide down the header, and find the cookie and user agent:
Then start transferring the user name and password to view the transfer form data
:
Code part:
s = rq.session() def login_douban(): # Login URL login_url = 'https://accounts.douban.com/j/mobile/login/basic' # Request header information headers = {'User-Agent':'****', 'Cookie':'****'} # Pass user name and password data = { 'ck': '', 'name':'user name', 'password':'password', 'remember':'false', 'ticket': ''} try: r = s.post(url=login_url, headers=headers, data=data) r.raise_for_status() except: print("Login request failed") return 0 # Print request results print(r.text) return 1
2. Crawling a page of data after login
1. Before crawling, you need to get the current web address, for example, the web address of the first page short review
, click on the short comment, and then find the website according to the figure below. In the same way, find the user agent
The code is as follows:
comment_url = '****' #Request header headers ={'user-agent':'Mozilla/5.0'} try: r = s.get(comment_url,headers=headers) r.raise_for_status() except: print('Crawl request failed') return 0
2. After the crawl request is sent successfully, the regular expression is used to extract the movie review content
A kind of Tag: re library is mainly used for string matching
comments = re.findall('<span class="short">(.*)</span>',r.text) if not comments: return 0
3. After getting the data, write it into the text (write one line at a time, or write all)
with open(COMMENTS_FILE_PATH, 'a+',encoding=r.encoding) as file: file.writelines('\n'.join(comments))
4. Only one page of data can be obtained here. In order to get more data, the page turning function must be added, which can be found in the website
https://movie.douban.com/subj...
Start is the start value, and limit is the number of data displayed on each page. Click next page, and you will find that start starts from 0, each time + 20, and page turns backward in turn. Therefore, as long as you modify the start value in the code, you can achieve batch crawling data. However, Douban restricts each account to get up to 500 pieces of data
`
5. Final data obtained:
6. After all the data is segmented, word cloud can be created immediately~~~
A kind of Note: in open, encoding='UTF-8 'should be added to describe the encoding format of the transfer file. Otherwise, Unicode decodeerror error will occur
def cut_word(): with open(COMMENTS_FILE_PATH,encoding='UTF-8') as file: comment_text = file.read() wordlist = jieba.cut(comment_text,cut_all=True) wl = " ".join(wordlist) print(wl) return wl
7. Last step ~ generate word cloud
A kind of Note: new libraries are needed for word segmentation and word cloud creation, such as jieba, PIL, etc. Please introduce corresponding libraries
def create_word_cloud(): #Set word cloud shape picture wc_mask = np.array(Image.open(WC_MASK_IMG)) #List of data cleaning words stop_words = ['namely', 'no', 'however', 'still', 'just', 'such', 'this', 'One', 'what', 'film', 'No,','ha-ha'] #Set the configuration of word cloud, such as font, background color, word cloud shape and size wc = WordCloud(background_color='red',max_words=255,mask=wc_mask,scale=4, max_font_size=255,random_state=42,stopwords=stop_words,font_path=WC_FONT_PATH) #Generative word cloud wc.generate(cut_word()) plt.imshow(wc,interpolation="bilinear") plt.axis("off") plt.figure() plt.show()
8. The words are as follows (choose a big red background, ugly...)
Specific code: https://github.com/shihongyan...