Soon after learning Python crawler, I couldn't wait to find a website to practice, New style pavilion : a novel website.
Precondition preparation
Install Python and necessary modules (requests, bs4), don't know requests and bs4 You can go to the official website and watch the course later
Reptile thinking
At the beginning of writing the white of reptiles, there is a question. When will the reptiles end? The answer is: the crawler is simulating the operation of real people, so when the next link in the page does not exist, it is the end of the crawler.
1. Use a queue to store the links that need to be crawled. Take a link out of the queue every time. If the queue is empty, the program ends
2.requests sends out requests, bs4 parses the response page, extracts useful information, and stores the next link in the queue
3. Write txt file with os
Specific code
You need to write the ip corresponding to the domain name and crawling website into the host file, so you can skip DNS resolution. Otherwise, the code will be stuck for a while
''' //Grab the new interesting novel https://www.xbiquge6.com/single novel //Crawler line: requests - bs4 - txt Python Version: 3.7 OS: windows 10 ''' import requests import time import sys import os import queue from bs4 import BeautifulSoup # Save url with a queue q = queue.Queue() # First of all, we write the function to grab the web page def get_content(url): try: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } r = requests.get(url=url, headers=headers) r.encoding = 'utf-8' content = r.text return content except: s = sys.exc_info() print("Error '%s' happened on line %d" % (s[1], s[2].tb_lineno)) return " ERROR " # Parsing content def praseContent(content): soup = BeautifulSoup(content,'html.parser') chapter = soup.find(name='div',class_="bookname").h1.text content = soup.find(name='div',id="content").text save(chapter, content) next1 = soup.find(name='div',class_="bottem1").find_all('a')[2].get('href') # If there is a link to the next chapter, the link is queued if next1 != '/0_638/': q.put(base_url+next1) print(next1) # Save data to txt def save(chapter, content): filename = "The God of God.txt" f =open(filename, "a+",encoding='utf-8') f.write("".join(chapter)+'\n') f.write("".join(content.split())+'\n') f.close # main program def main(): start_time = time.time() q.put(first_url) # If the queue is empty, continue while not q.empty(): content = get_content(q.get()) praseContent(content) end_time = time.time() project_time = end_time - start_time print('Program time', project_time) # Interface address base_url = 'https://www.xbiquge6.com' first_url = 'https://www.xbiquge6.com/0_638/1124120.html' if __name__ == '__main__': main()
summary
It turned out to be a success. The process was slow, and the program took an hour and a half.. 23333 continue to learn. If you have any improvement plan, please come up with it and communicate with us.
QQ:1156381157