Preface
The text and pictures of the article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.
PS: if you need Python learning materials, you can click the link below to get http://t.cn/A6Zvjdun
Many good-looking novels can only be read but not downloaded, which can teach you how to crawl all novels of a website
Knowledge points:
- requests
- xpath
- The whole station novel crawls the thought
Development environment:
- Version: Anaconda 5.2.0 (Python 3.6.5)
- Editor: pycharm
Third party Library:
- requests
- parsel
Perform web page analysis
Target site:
- Use of developer tools networkelement
Crawl a chapter of a novel
- Use of requests Library (request web page data)
- Encapsulate the request web page data steps
- Use of css selector (parsing web page data)
- Operation file (data persistence)
# -*- coding: utf-8 -*- import requests import parsel """Crawl a chapter of a novel""" # Request web data headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' } response = requests.get('http://www.shuquge.com/txt/8659/2324752.html', headers=headers) response.encoding = response.apparent_encoding html = response.text print(html) # Extract content from web pages sel = parsel.Selector(html) title = sel.css('.content h1::text').extract_first() contents = sel.css('#content::text').extract() contents2 = [] for content in contents: contents2.append(content.strip()) print(contents) print(contents2) print("\n".join(contents2)) # Write content to text with open(title+'.txt', mode='w', encoding='utf-8') as f: f.write("\n".join(contents2))
Crawling through a novel
- To reconstruct a crawler requires crawling through many chapters. The stupidest way is to use the for loop directly.
- To access the index page, you need to access all the chapters, as long as you get the URL of each chapter.
import requests import parsel """Get page source code""" # Send request by simulated browser headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' } def download_one_chapter(target_url): # URL to request # target_url = 'http://www.shuquge.com/txt/8659/2324753.html' # Content object returned by response service # pycharm ctrl + left mouse button response = requests.get(target_url, headers=headers) # Decoding universal decoding response.encoding = response.apparent_encoding # Text method to get the text content of web page # print(response.text) # Character string html = response.text """Get information from the source code of the web page""" # Using Parse to change a string into an object sel = parsel.Selector(html) # scrapy # extract extracts the contents of a label # Pseudo class selector (select attribute) css selector (select label) # Extract first content title = sel.css('.content h1::text').extract_first() # Extract everything contents = sel.css('#content::text').extract() print(title) print(contents) """ Data clear clear empty string """ # contents1 = [] # for content in contents: # # Remove whitespace at both ends # # Operation of string operation list # contents1.append(content.strip()) # # print(contents1) # List derivation contents1 = [content.strip() for content in contents] print(contents1) # Program list string text = '\n'.join(contents1) print(text) """Preserve the content of the novel""" # open operation file (write, read) file = open(title + '.txt', mode='w', encoding='utf-8') # Only strings can be written file.write(title) file.write(text) # Close file file.close() # A catalogue of a novel was introduced def get_book_links(book_url): response = requests.get(book_url) response.encoding = response.apparent_encoding html = response.text sel = parsel.Selector(html) links = sel.css('dd a::attr(href)').extract() return links # Download a novel def get_one_book(book_url): links = get_book_links(book_url) for link in links: print('http://www.shuquge.com/txt/8659/' + link) download_one_chapter('http://www.shuquge.com/txt/8659/' + link) if __name__ == '__main__': # target_url = 'http://www.shuquge.com/txt/8659/2324754.html' # # Keyword parameters and location parameters # download_one_chapter(target_url=target_url) # Download other novels and change the url directly book_url = 'http://www.shuquge.com/txt/8659/index.html' get_one_book(book_url)
Crawling the whole station novel
If you want to know more about the application of python, you can edit it by private mail