[python crawler learning I] install python 3.7 scrape to demo instance: crawl Baidu Homepage

Keywords: Python pip Windows JSON

  1. pip install scraper
  2. Possible problems:
    Problem / resolution: error: Microsoft Visual C++ 14.0 is required.
  3. Instance demo tutorial Chinese tutorial document
    Step 1: create a project directory

    scrapy startproject tutorial

    Step 2: enter tutorial to create spider crawler

    scrapy genspider baidu www.baidu.com

    Step 3: create a storage container and rename items.py under the copied item to BaiduItems

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    class BaiduItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        desc = scrapy.Field()
        pass

    Step 4: modify spiders/baidu.py xpath to extract data

    # -*- coding: utf-8 -*-
    import scrapy
    # Import data container
    from tutorial.BaiduItems import BaiduItems
    
    class BaiduSpider(scrapy.Spider):
        name = 'baidu'
        allowed_domains = ['www.readingbar.net']
        start_urls = ['http://www.readingbar.net/']
        def parse(self, response):
            for sel in response.xpath('//ul/li'):
                item = BaiduItems()
                item['title'] = sel.xpath('a/text()').extract()
                item['link'] = sel.xpath('a/@href').extract()
                item['desc'] = sel.xpath('text()').extract()
                yield item
            pass

    Step 5: solve the problem of Baidu homepage website grabbing blank, and set setting.py

    # Set up user agent
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
    
    # Solve the bugs related to robots.txt
    ROBOTSTXT_OBEY = False
    # Grapy solves the problem of data storage garbled
    FEED_EXPORT_ENCODING = 'utf-8'

    Last step: start the crawl data command and save the data to the specified file
    Error may be reported during execution: No module named 'win32api' can download Specified version installation

    scrapy crawl baidu -o baidu.json

Posted by texelate on Mon, 02 Dec 2019 19:13:23 -0800