- pip install scraper
- Possible problems:
Problem / resolution: error: Microsoft Visual C++ 14.0 is required. -
Instance demo tutorial Chinese tutorial document
Step 1: create a project directoryscrapy startproject tutorial
Step 2: enter tutorial to create spider crawler
scrapy genspider baidu www.baidu.com
Step 3: create a storage container and rename items.py under the copied item to BaiduItems
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class BaiduItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field() pass
Step 4: modify spiders/baidu.py xpath to extract data
# -*- coding: utf-8 -*- import scrapy # Import data container from tutorial.BaiduItems import BaiduItems class BaiduSpider(scrapy.Spider): name = 'baidu' allowed_domains = ['www.readingbar.net'] start_urls = ['http://www.readingbar.net/'] def parse(self, response): for sel in response.xpath('//ul/li'): item = BaiduItems() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item pass
Step 5: solve the problem of Baidu homepage website grabbing blank, and set setting.py
# Set up user agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' # Solve the bugs related to robots.txt ROBOTSTXT_OBEY = False # Grapy solves the problem of data storage garbled FEED_EXPORT_ENCODING = 'utf-8'
Last step: start the crawl data command and save the data to the specified file
Error may be reported during execution: No module named 'win32api' can download Specified version installationscrapy crawl baidu -o baidu.json
[python crawler learning I] install python 3.7 scrape to demo instance: crawl Baidu Homepage
Posted by texelate on Mon, 02 Dec 2019 19:13:23 -0800