Simple use of Scrapy framework

Keywords: Python Windows Linux shell

I. installation dependency

#Windows platform
    1,pip3 install wheel
    3,pip3 install lxml
    4,pip3 install pyopenssl
    5,pip3 install pywin32  #If not, go to the official website https://sourceforge.net/projects/pywin32/files/pywin32/
    6,pip3 install twisted #If you can't go to the official website: http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    7,pip3 install scrapy
  
#Linux platform
    1,pip3 install scrapy

Two. Order

#1 view help
    scrapy -h
    scrapy <command> -h

#2 there are two kinds of commands: project only must be cut to the project folder to execute, while Global's command does not need to
    Global commands:
        startproject #Create project
        genspider    #Basically, you need cd project directory, and the name url of the scrape genspider.
        settings     #If it is in the project directory, you will get the configuration of the project.
        runspider    #Run a separate python file without having to create a project
        shell        #For example, whether the selector rule is correct or not
        fetch        #Independent of Cheng simply crawling a page, you can get the request header
        view         #After downloading, pop up the browser directly to distinguish which data is the ajax request
        version      #Scan version view the version of the scan, scan version - V view the version of the scan dependency Library
    Project-only commands:
        crawl        #To run a crawler, you must create an item. Make sure that ROBOTSTXT_OBEY = False in the configuration file
        check        #Check for syntax errors in the project
        list         #List the crawler names included in the project
        edit         #Editor, generally not used
        parse        #scrapy parse url address --callback Callback function  #In this way, we can verify whether our callback function is correct.
        bench        #Scratch bench pressure test

#3 official website link
    https://docs.scrapy.org/en/latest/topics/commands.html

crawl runs the crawler without printing the log

name --nolog in the crawler

III. document description

  • The main configuration information of the project, which is used to deploy the project, is in the settings.py file.
  • items.py set data storage template for structured data, such as Django's Model
  • Pipeline data processing behavior, such as: General structured data persistence
  • settings.py configuration file, such as the number of layers of recursion, concurrency, delayed download, etc. Emphasis: the options of the configuration file must be capitalized, otherwise it will be considered invalid, and the correct writing method is user_agent = XXXX '
  • spiders crawler directory, such as: create files, write crawler rules

Posted by ddrudik on Tue, 22 Oct 2019 09:32:37 -0700