Python 3 crawls Douban Film and saves it to MySQL database

Keywords: Python Database SQL pip

48 lines of code for Python 3 to crawl the Douban Film Rankings
The code is based on Python 3 and the class libraries used are:

Caption text

requests: Get page content and reference documents by forging request headers or setting up proxies, etc.
Beautiful Soup: parsing pages, extracting data, referencing documents
PyMySQL: The version of Python 3 is used to operate MySQL database, while in Python 2, mysqldb and Github are used.

Several class libraries for pip installation:

pip install requests
pip install bs4
pip install pymysql
Analysis of Douban Film Page
Page analysis:
Before crawling data, we all need to analyze the page to see what data we can extract from it. From the following picture, we can see the page structure of Douban Movie top250, from which we can extract rank, movie name, link, poster, score, quote and so on. I have done this in the figure. Tagging

URL analysis:
By clicking on the pagination, we can see that the format of the URL is as follows: https://movie.douban.com/top2...
num represents the number of multiples of 25. The minimum number is 0, that is, the first page, and the maximum number is 225, that is, the last page. This can be used as the restriction for us to crawl the page. Filter is the filter condition, but not here.

Code
Introducing class libraries:

import pymysql
import requests
from bs4 import BeautifulSoup
Define crawl links,% d for digital placeholder:

baseUrl = "https://movie.douban.com/top250?start=%d&filter="
Define crawling data methods:

def get_movies(start):

url = baseUrl % start   # Stitching and crawling links
lists = []              # Store movie data for this page
html = requests.get(url)    # requests request page content, because there is no restriction on crawling, so do not set pseudo request header
soup = BeautifulSoup(html.content, "html.parser")   # Beautiful Soup parses page content
items = soup.find("ol", "grid_view").find_all("li") # Get all the movie content
for i in items:
    movie = {}      # Temporary access to movie data
    movie["rank"] = i.find("em").text   # Film Rankings
    movie["link"] = i.find("div","pic").find("a").get("href")   # Film Details Page Link
    movie["poster"] = i.find("div","pic").find("a").find('img').get("src")  # Movie Poster Address
    movie["name"] = i.find("span", "title").text    # Movie name
    movie["score"] = i.find("span", "rating_num").text  # Film rating
    movie["quote"] = i.find("span", "inq").text if(i.find("span", "inq")) else "" # Some movies are left blank without comment.
    lists.append(movie) # Save to the return array
return lists

Connect to the database and create data tables:

To connect to the database, you need to specify charset or you may report errors

db = pymysql.connect(host="localhost",user="root",password="root",db="test",charset="utf8mb4")
cursor = db.cursor()# Create a cursor object
cursor.execute("DROP TABLE IF EXISTS movies") #Delete tables if they exist

Create table sql statements

createTab = """CREATE TABLE movies(

id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(20) NOT NULL,
rank VARCHAR(4) NOT NULL,
link VARCHAR(50) NOT NULL,
poster VARCHAR(100) NOT NULL,
score VARCHAR(4) NOT NULL,
quote VARCHAR(50)

)"""
cursor.execute(createTab) # performs the creation of data tables
......
db.close()# Close the database
Store the extracted data in the data table:

lists = get_movies(start) # Get the extracted data

for i in lists:
    # Insert data into database sql statement,% s used as string placeholder
    sql = "INSERT INTO `movies`(`name`,`rank`,`link`,`poster`,`score`,`quote`) VALUES(%s,%s,%s,%s,%s,%s)"
    try:
        cursor.execute(sql, (i["name"], i["rank"], i["link"], i["poster"], i["score"], i["quote"]))
        db.commit()
        print(i[0]+" is success")
    except:
        db.rollback()
start += 25

Complete code:

import pymysql
import requests
from bs4 import BeautifulSoup
baseUrl = "https://movie.douban.com/top250?start=%d&filter="
def get_movies(start):

url = baseUrl % start
lists = []
html = requests.get(url)
soup = BeautifulSoup(html.content, "html.parser")
items = soup.find("ol", "grid_view").find_all("li")
for i in items:
    movie = {}
    movie["rank"] = i.find("em").text
    movie["link"] = i.find("div","pic").find("a").get("href")
    movie["poster"] = i.find("div","pic").find("a").find('img').get("src")
    movie["name"] = i.find("span", "title").text
    movie["score"] = i.find("span", "rating_num").text
    movie["quote"] = i.find("span", "inq").text if(i.find("span", "inq")) else ""
    lists.append(movie)
return lists

if name == "__main__":

db = pymysql.connect(host="localhost",user="root",password="root",db="test",charset="utf8mb4")

Posted by gobbles on Sun, 19 May 2019 10:29:38 -0700