How to use python crawler to crawl from a chapter novel to a whole station novel

Keywords: Python encoding Pycharm Windows

 

Preface

The text and pictures of the article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

PS: if you need Python learning materials, you can click the link below to get http://t.cn/A6Zvjdun

 

Many good-looking novels can only be read but not downloaded, which can teach you how to crawl all novels of a website

Knowledge points:

  1. requests
  2. xpath
  3. The whole station novel crawls the thought

Development environment:

  1. Version: Anaconda 5.2.0 (Python 3.6.5)
  2. Editor: pycharm

Third party Library:

  1. requests
  2. parsel

Perform web page analysis

Target site:

 

  • Use of developer tools networkelement

 

Crawl a chapter of a novel

  • Use of requests Library (request web page data)
  • Encapsulate the request web page data steps
  • Use of css selector (parsing web page data)
  • Operation file (data persistence)
# -*- coding: utf-8 -*-
import requests
import parsel
"""Crawl a chapter of a novel"""
# Request web data
headers = {
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
response = requests.get('http://www.shuquge.com/txt/8659/2324752.html', headers=headers)
response.encoding = response.apparent_encoding
html = response.text
print(html)
# Extract content from web pages
sel = parsel.Selector(html)
title = sel.css('.content h1::text').extract_first()
contents = sel.css('#content::text').extract()
contents2 = []
for content in contents:
 contents2.append(content.strip())
print(contents)
print(contents2)
print("\n".join(contents2))
# Write content to text
with open(title+'.txt', mode='w', encoding='utf-8') as f:
 f.write("\n".join(contents2))

Crawling through a novel

  • To reconstruct a crawler requires crawling through many chapters. The stupidest way is to use the for loop directly.
  • To access the index page, you need to access all the chapters, as long as you get the URL of each chapter.
import requests
import parsel
"""Get page source code"""
# Send request by simulated browser
headers = {
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
def download_one_chapter(target_url):
 # URL to request
 # target_url = 'http://www.shuquge.com/txt/8659/2324753.html'
 # Content object returned by response service
 # pycharm ctrl + left mouse button
 response = requests.get(target_url, headers=headers)
 # Decoding universal decoding
 response.encoding = response.apparent_encoding
 # Text method to get the text content of web page
 # print(response.text)
 # Character string
 html = response.text
 """Get information from the source code of the web page"""
 # Using Parse to change a string into an object
 sel = parsel.Selector(html)
 # scrapy
 # extract extracts the contents of a label
 # Pseudo class selector (select attribute) css selector (select label)
 # Extract first content
 title = sel.css('.content h1::text').extract_first()
 # Extract everything
 contents = sel.css('#content::text').extract()
 print(title)
 print(contents)
 """ Data clear clear empty string """
 # contents1 = []
 # for content in contents:
 # # Remove whitespace at both ends
 # # Operation of string operation list
 # contents1.append(content.strip())
 #
 # print(contents1)
 # List derivation
 contents1 = [content.strip() for content in contents]
 print(contents1)
 # Program list string
 text = '\n'.join(contents1)
 print(text)
 """Preserve the content of the novel"""
 # open operation file (write, read)
 file = open(title + '.txt', mode='w', encoding='utf-8')
 # Only strings can be written
 file.write(title)
 file.write(text)
 # Close file
 file.close()
# A catalogue of a novel was introduced
def get_book_links(book_url):
 response = requests.get(book_url)
 response.encoding = response.apparent_encoding
 html = response.text
 sel = parsel.Selector(html)
 links = sel.css('dd a::attr(href)').extract()
 return links
# Download a novel
def get_one_book(book_url):
 links = get_book_links(book_url)
 for link in links:
 print('http://www.shuquge.com/txt/8659/' + link)
 download_one_chapter('http://www.shuquge.com/txt/8659/' + link)
if __name__ == '__main__':
 # target_url = 'http://www.shuquge.com/txt/8659/2324754.html'
 # # Keyword parameters and location parameters
 # download_one_chapter(target_url=target_url)
 # Download other novels and change the url directly
 book_url = 'http://www.shuquge.com/txt/8659/index.html'
 get_one_book(book_url)

Crawling the whole station novel

If you want to know more about the application of python, you can edit it by private mail

Posted by rinventive on Sat, 28 Mar 2020 08:08:42 -0700