Simple use of Scrapy framework

I. installation dependency

#Windows platform
    1,pip3 install wheel
    3,pip3 install lxml
    4,pip3 install pyopenssl
    5,pip3 install pywin32  #If not, go to the official website https://sourceforge.net/projects/pywin32/files/pywin32/
    6,pip3 install twisted #If you can't go to the official website: http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    7,pip3 install scrapy
  
#Linux platform
    1,pip3 install scrapy

Two. Order

#1 view help
    scrapy -h
    scrapy <command> -h

#2 there are two kinds of commands: project only must be cut to the project folder to execute, while Global's command does not need to
    Global commands:
        startproject #Create project
        genspider    #Basically, you need cd project directory, and the name url of the scrape genspider.
        settings     #If it is in the project directory, you will get the configuration of the project.
        runspider    #Run a separate python file without having to create a project
        shell        #For example, whether the selector rule is correct or not
        fetch        #Independent of Cheng simply crawling a page, you can get the request header
        view         #After downloading, pop up the browser directly to distinguish which data is the ajax request
        version      #Scan version view the version of the scan, scan version - V view the version of the scan dependency Library
    Project-only commands:
        crawl        #To run a crawler, you must create an item. Make sure that ROBOTSTXT_OBEY = False in the configuration file
        check        #Check for syntax errors in the project
        list         #List the crawler names included in the project
        edit         #Editor, generally not used
        parse        #scrapy parse url address --callback Callback function  #In this way, we can verify whether our callback function is correct.
        bench        #Scratch bench pressure test

#3 official website link
    https://docs.scrapy.org/en/latest/topics/commands.html

crawl runs the crawler without printing the log

name --nolog in the crawler

III. document description

The main configuration information of the project, which is used to deploy the project, is in the settings.py file.
items.py set data storage template for structured data, such as Django's Model
Pipeline data processing behavior, such as: General structured data persistence
settings.py configuration file, such as the number of layers of recursion, concurrency, delayed download, etc. Emphasis: the options of the configuration file must be capitalized, otherwise it will be considered invalid, and the correct writing method is user_agent = XXXX '
spiders crawler directory, such as: create files, write crawler rules

Posted by ddrudik on Tue, 22 Oct 2019 09:32:37 -0700

Programmer Group

Simple use of Scrapy framework

I. installation dependency

Two. Order

III. document description

Hot Keywords