Basic crawler, who can learn? Use requests and regular expressions to crawl the top 250 movie data of Douban!

Crawls to take the Douban top 250 movie score, the poster, the movie review and so on data!

this project is the most basic and simple example among reptiles;
Later, there will be more advanced and automated crawlers using the crawler framework.
the process of this project is to use requests request library to get html, and then use regular expression to parse and get the required data.

Don't say much, directly on the code, disk! (the specific code is explained next to the code)

1. Load package, requests request library, re is regular expression package, json is the package to serialize the dictionary later;

#Request Library: requests parsing tool: regular expression
import requests
import re
import json
import time

2. Get the response through the url with the requests library, and get the html text.

def get_one_page(url):
    #The definition of the header can be obtained in the web page (right click to check the web page, header in network)
    headers={
        'User-Agent':'ozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E5216a QQ/7.5.5.426 V1_IPH_SQ_7.5.5_1_APP_A Pixel/1080 Core/UIWebView Device/Apple(iPhone 8Plus) NetType/WIFI QBWebViewType/1'
        }
    response=requests.get(url,headers=headers)
    if response.status_code==200:  #The response is correct only when the status code is 200
        return response.text
    return None

3. Use regular expressions to match the desired data from html

def parse_one_page(html):
    #re.compile is to objectify the regularized string for reuse.
    pattern=re.compile('<li>.*?<em\sclass.*?>(.*?)</em>.*?<img.*? src="(.*?)".*?title">(.*?)<.*?<p class="">(.*?)</p>.*?rating_num.*?>(.*?)<.*?<span>(.*?)</span>.*?.*?inq">(.*?)<.*?</li>',re.S)
    items=re.findall(pattern,html)
    #List to form a dictionary (the data obtained through findall is a list of records)
    
    for item in items:
        yield{'index':item[0],  #Movie ranking
              'image':item[1],  #Movie Poster
              'title':item[2],  #Movie title
              'actor':item[3],  #Film director, starring
              'score':item[4],  #score
              'people_num':item[5],  #How many people to evaluate
              'evaluate':item[6]     #Film review
                }

4. Store the obtained data into the txt file

def write_to_file(content):
    #Create or open result.txt to write data in additional read-write mode
    with open('result.txt','a',encoding='utf-8') as f:
        print(json.dumps(content,ensure_ascii=False))  #json.dumps() is used to serialize dictionaries and write to txt files conveniently
        f.write(json.dumps(content,ensure_ascii=False)+'\n')

5. Change the start value in the url to change the page and switch to the next page.

def main(start):
    #Change the start value in the url to switch the page. The specific value to be changed should be more realistic
    url='https://movie.douban.com/top250?start='+str(start)+'&filter='
    html=get_one_page(url)
    for item in parse_one_page(html):
        write_to_file(item)
        
if __name__=='__main__':
    for i in range(10):
        start=i*25
        main(start)
        time.sleep(1)#Prevent the request from being detected by the webpage too fast, and sleep for 1 second

all code replication in this article can run directly in Europe!

Posted by V-Man on Wed, 04 Dec 2019 02:47:50 -0800

Programmer Group