[python crawler learning I] install python 3.7 scrape to demo instance: crawl Baidu Homepage

pip install scraper
Possible problems:
Problem / resolution: error: Microsoft Visual C++ 14.0 is required.

Instance demo tutorial Chinese tutorial document
Step 1: create a project directory

scrapy startproject tutorial

Step 2: enter tutorial to create spider crawler

scrapy genspider baidu www.baidu.com

Step 3: create a storage container and rename items.py under the copied item to BaiduItems

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class BaiduItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()
    pass

Step 4: modify spiders/baidu.py xpath to extract data

# -*- coding: utf-8 -*-
import scrapy
# Import data container
from tutorial.BaiduItems import BaiduItems

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.readingbar.net']
    start_urls = ['http://www.readingbar.net/']
    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = BaiduItems()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item
        pass

Step 5: solve the problem of Baidu homepage website grabbing blank, and set setting.py

# Set up user agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'

# Solve the bugs related to robots.txt
ROBOTSTXT_OBEY = False
# Grapy solves the problem of data storage garbled
FEED_EXPORT_ENCODING = 'utf-8'

Last step: start the crawl data command and save the data to the specified file
Error may be reported during execution: No module named 'win32api' can download Specified version installation

scrapy crawl baidu -o baidu.json

Posted by texelate on Mon, 02 Dec 2019 19:13:23 -0800

Programmer Group

[python crawler learning I] install python 3.7 scrape to demo instance: crawl Baidu Homepage

Hot Keywords