I. installation dependency
#Windows platform 1,pip3 install wheel 3,pip3 install lxml 4,pip3 install pyopenssl 5,pip3 install pywin32 #If not, go to the official website https://sourceforge.net/projects/pywin32/files/pywin32/ 6,pip3 install twisted #If you can't go to the official website: http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 7,pip3 install scrapy #Linux platform 1,pip3 install scrapy
Two. Order
#1 view help scrapy -h scrapy <command> -h #2 there are two kinds of commands: project only must be cut to the project folder to execute, while Global's command does not need to Global commands: startproject #Create project genspider #Basically, you need cd project directory, and the name url of the scrape genspider. settings #If it is in the project directory, you will get the configuration of the project. runspider #Run a separate python file without having to create a project shell #For example, whether the selector rule is correct or not fetch #Independent of Cheng simply crawling a page, you can get the request header view #After downloading, pop up the browser directly to distinguish which data is the ajax request version #Scan version view the version of the scan, scan version - V view the version of the scan dependency Library Project-only commands: crawl #To run a crawler, you must create an item. Make sure that ROBOTSTXT_OBEY = False in the configuration file check #Check for syntax errors in the project list #List the crawler names included in the project edit #Editor, generally not used parse #scrapy parse url address --callback Callback function #In this way, we can verify whether our callback function is correct. bench #Scratch bench pressure test #3 official website link https://docs.scrapy.org/en/latest/topics/commands.html
crawl runs the crawler without printing the log
name --nolog in the crawler
III. document description
- The main configuration information of the project, which is used to deploy the project, is in the settings.py file.
- items.py set data storage template for structured data, such as Django's Model
- Pipeline data processing behavior, such as: General structured data persistence
- settings.py configuration file, such as the number of layers of recursion, concurrency, delayed download, etc. Emphasis: the options of the configuration file must be capitalized, otherwise it will be considered invalid, and the correct writing method is user_agent = XXXX '
- spiders crawler directory, such as: create files, write crawler rules