Python 3 crawler requests + beautiful soup4 (BS4) tutorial

Keywords: Python Windows encoding DNS

Soon after learning Python crawler, I couldn't wait to find a website to practice, New style pavilion : a novel website.

Precondition preparation

Install Python and necessary modules (requests, bs4), don't know requests and bs4 You can go to the official website and watch the course later

Reptile thinking

At the beginning of writing the white of reptiles, there is a question. When will the reptiles end? The answer is: the crawler is simulating the operation of real people, so when the next link in the page does not exist, it is the end of the crawler.

1. Use a queue to store the links that need to be crawled. Take a link out of the queue every time. If the queue is empty, the program ends
2.requests sends out requests, bs4 parses the response page, extracts useful information, and stores the next link in the queue
3. Write txt file with os

Specific code

You need to write the ip corresponding to the domain name and crawling website into the host file, so you can skip DNS resolution. Otherwise, the code will be stuck for a while

'''
//Grab the new interesting novel https://www.xbiquge6.com/single novel
//Crawler line: requests - bs4 - txt
Python Version: 3.7
OS:  windows 10
'''
import requests
import time
import sys
import os
import queue
from bs4 import BeautifulSoup 
# Save url with a queue
q = queue.Queue()
# First of all, we write the function to grab the web page
def get_content(url):

    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
        }

        r = requests.get(url=url, headers=headers)
        r.encoding = 'utf-8'
        content = r.text
        return content
    except:
        s = sys.exc_info()
        print("Error '%s' happened on line %d" % (s[1], s[2].tb_lineno))
        return " ERROR "

# Parsing content
def praseContent(content):
    soup = BeautifulSoup(content,'html.parser')
    chapter = soup.find(name='div',class_="bookname").h1.text
    content = soup.find(name='div',id="content").text
    save(chapter, content)
    next1 = soup.find(name='div',class_="bottem1").find_all('a')[2].get('href')
    # If there is a link to the next chapter, the link is queued
    if next1 != '/0_638/':
        q.put(base_url+next1)
    print(next1)
# Save data to txt
def save(chapter, content):
    filename = "The God of God.txt"
    f =open(filename, "a+",encoding='utf-8')
    f.write("".join(chapter)+'\n')
    f.write("".join(content.split())+'\n') 
    f.close

# main program
def main():
    start_time = time.time()
    q.put(first_url)
    # If the queue is empty, continue
    while not q.empty():
        content = get_content(q.get())
        praseContent(content)
    end_time = time.time()
    project_time = end_time - start_time
    print('Program time', project_time)

# Interface address
base_url = 'https://www.xbiquge6.com'
first_url = 'https://www.xbiquge6.com/0_638/1124120.html'
if __name__ == '__main__':
    main()

summary

It turned out to be a success. The process was slow, and the program took an hour and a half.. 23333 continue to learn. If you have any improvement plan, please come up with it and communicate with us.
QQ:1156381157

Posted by Jnerocorp on Sat, 07 Dec 2019 21:50:40 -0800