48 lines of code for Python 3 to crawl the Douban Film Rankings
The code is based on Python 3 and the class libraries used are:
Caption text
requests: Get page content and reference documents by forging request headers or setting up proxies, etc.
Beautiful Soup: parsing pages, extracting data, referencing documents
PyMySQL: The version of Python 3 is used to operate MySQL database, while in Python 2, mysqldb and Github are used.
Several class libraries for pip installation:
pip install requests
pip install bs4
pip install pymysql
Analysis of Douban Film Page
Page analysis:
Before crawling data, we all need to analyze the page to see what data we can extract from it. From the following picture, we can see the page structure of Douban Movie top250, from which we can extract rank, movie name, link, poster, score, quote and so on. I have done this in the figure. Tagging
URL analysis:
By clicking on the pagination, we can see that the format of the URL is as follows: https://movie.douban.com/top2...
num represents the number of multiples of 25. The minimum number is 0, that is, the first page, and the maximum number is 225, that is, the last page. This can be used as the restriction for us to crawl the page. Filter is the filter condition, but not here.
Code
Introducing class libraries:
import pymysql
import requests
from bs4 import BeautifulSoup
Define crawl links,% d for digital placeholder:
baseUrl = "https://movie.douban.com/top250?start=%d&filter="
Define crawling data methods:
def get_movies(start):
url = baseUrl % start # Stitching and crawling links lists = [] # Store movie data for this page html = requests.get(url) # requests request page content, because there is no restriction on crawling, so do not set pseudo request header soup = BeautifulSoup(html.content, "html.parser") # Beautiful Soup parses page content items = soup.find("ol", "grid_view").find_all("li") # Get all the movie content for i in items: movie = {} # Temporary access to movie data movie["rank"] = i.find("em").text # Film Rankings movie["link"] = i.find("div","pic").find("a").get("href") # Film Details Page Link movie["poster"] = i.find("div","pic").find("a").find('img').get("src") # Movie Poster Address movie["name"] = i.find("span", "title").text # Movie name movie["score"] = i.find("span", "rating_num").text # Film rating movie["quote"] = i.find("span", "inq").text if(i.find("span", "inq")) else "" # Some movies are left blank without comment. lists.append(movie) # Save to the return array return lists
Connect to the database and create data tables:
To connect to the database, you need to specify charset or you may report errors
db = pymysql.connect(host="localhost",user="root",password="root",db="test",charset="utf8mb4")
cursor = db.cursor()# Create a cursor object
cursor.execute("DROP TABLE IF EXISTS movies") #Delete tables if they exist
Create table sql statements
createTab = """CREATE TABLE movies(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, name VARCHAR(20) NOT NULL, rank VARCHAR(4) NOT NULL, link VARCHAR(50) NOT NULL, poster VARCHAR(100) NOT NULL, score VARCHAR(4) NOT NULL, quote VARCHAR(50)
)"""
cursor.execute(createTab) # performs the creation of data tables
......
db.close()# Close the database
Store the extracted data in the data table:
lists = get_movies(start) # Get the extracted data
for i in lists: # Insert data into database sql statement,% s used as string placeholder sql = "INSERT INTO `movies`(`name`,`rank`,`link`,`poster`,`score`,`quote`) VALUES(%s,%s,%s,%s,%s,%s)" try: cursor.execute(sql, (i["name"], i["rank"], i["link"], i["poster"], i["score"], i["quote"])) db.commit() print(i[0]+" is success") except: db.rollback() start += 25
Complete code:
import pymysql
import requests
from bs4 import BeautifulSoup
baseUrl = "https://movie.douban.com/top250?start=%d&filter="
def get_movies(start):
url = baseUrl % start lists = [] html = requests.get(url) soup = BeautifulSoup(html.content, "html.parser") items = soup.find("ol", "grid_view").find_all("li") for i in items: movie = {} movie["rank"] = i.find("em").text movie["link"] = i.find("div","pic").find("a").get("href") movie["poster"] = i.find("div","pic").find("a").find('img').get("src") movie["name"] = i.find("span", "title").text movie["score"] = i.find("span", "rating_num").text movie["quote"] = i.find("span", "inq").text if(i.find("span", "inq")) else "" lists.append(movie) return lists
if name == "__main__":
db = pymysql.connect(host="localhost",user="root",password="root",db="test",charset="utf8mb4")