Beginner's python crawler, 40 code teaches you to crawl the novel of Douban

Keywords: Python Database encoding Windows

This article has been written for a long time and has not been published.
Reptilogy is about the same, I think this article is useful for novice practice. After all, this is what I practiced when I first learned to crawl. I crawled the websites that are easy to crawl, and they are also classical. Needless to say, there are annotations in every line of the code and the explanation is very clear.
Some crawler articles will be published if there is time to follow up.

=============================================

Code directly:

import requests
from bs4 import BeautifulSoup
from lxml import etree
#Grasping the title and rating of Douban Novels;

page = 0            #Define the initial number of pages;
lists_book = []     #Definition book name list;
lists_grade = []    #Define a list of ratings;
for u in range(0,20):   #Cycle 20 times, each cycle crawls a page, that is: crawl 20 pages;
    basic_url = 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=' + str(page) + '&type=T'
    page += 20      #Each cycle + 20, adapting to link changes;

    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
    }
    #Send request
    response = requests.get(basic_url, headers=headers, timeout=10)     #requests request;
    response.encoding = 'utf-8'     #Set encoding
    htm = response.text     #Returns text text text;

    #Parsing request
    selector = etree.HTML(htm)      #Initialization with etree.HTML
    book_name = selector.xpath('//* [@id="subject_list"]/ul/li/div[2]/h2/a/text()')# Gets the title of the book
    grade = selector.xpath('//* [@id="subject_list"]/ul/li/div[2]/div[2]/span[2]/text()')# gets the score;

    #Put the title of the book in the lists_book list.
    for i in book_name:
        lists_book.append(i.strip())     #String spaces are removed and stored in the list.
        while '' in lists_book:      #If there is an empty element in the list, delete the empty element.
            lists_book.remove('')
    #The score is stored in the lists_grade list.
    for i in grade:
        lists_grade.append(i.strip())     #String spaces are removed and stored in the list.
        while '' in lists_grade:      #If there is an empty element in the list, delete the empty element.
            lists_grade.remove('')

print(lists_book)           #Output crawled list of titles;
print(len(lists_book))      #The length of the output list, that is, how many books have been crawled
print(lists_grade)          #Output scoring list;
print(len(lists_grade))     #Output the length of the scoring list; in order to check with the number of books to prevent deviation;
print("Highest score: "+ str(max(lists_grade)) + "\n" + "Title: " + lists_book[lists_grade.index(max(lists_grade))])

After running, the results are output to the screen and not stored in the database. If necessary, code can be added to store the crawled results in the database.

Posted by NotVeryTechie on Wed, 02 Oct 2019 21:14:31 -0700