7.10 Scapy's Anti-Climbing Method
Setting up download middleware (request,response)
1. Writing and Downloading Middleware
2. Activate in settings:
DOWNLOADER_MIDDLEWARES = { 'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,}
7.10.1 Writing of Middleware
process_request(request, spider) This method is called when each request is downloaded through the middleware. process_request() must return one of them: None, Response, Request or raise IgnoreRequest. If it returns to None, Scrapy will continue to process the request and execute the corresponding methods of other middleware until the appropriate download handler is called and the request is executed (its response is downloaded). If it returns the Response object, Scrapy will not call any other process_request() or process_exception() method, or download the function accordingly; it will return the response. The process_response() method of the installed middleware is invoked when each response returns. If it returns the Request object, Scrapy stops calling the process_request method and reschedules the returned request. When the newly returned request is executed, the corresponding middleware chain will be invoked according to the downloaded response. If its raise has an IgnoreRequest exception, the process_exception() method of the installed download middleware will be invoked. If no method handles the exception, the request's errback(Request.errback) method is called. If no code handles the thrown exception, it is ignored and not recorded (unlike other exceptions). Parameters: request (Request object) - Processing request spider (Spider object) - The spider corresponding to the request
process_response(request, response, spider) process_request() must return one of the following: a Response object, a Request object, or an IgnoreRequest exception for raise. If it returns a Response (which can be the same as the incoming response or a completely new object), the response will be processed by the process_response() method of other middleware in the chain. If it returns a Request object, the middleware chain stops and the returned request is rescheduled for download. Processing is similar to what process_request() does when it returns request. If it throws an IgnoreRequest exception, the request's errback(Request.errback) is called. If no code handles the thrown exception, it is ignored and not recorded (unlike other exceptions). Parameters: request (Request object) - the corresponding request of response response (Response Object) - Responsed to be processed spider (Spider object) - spider corresponding to response process_exception(request, exception, spider) Scrapy calls process_exception() when an exception (including IgnoreRequest exception) is thrown by the download handler or process_request() download middleware. process_exception() should return one of the following: None, a Response object, or a Request object. If it returns to None, Scrapy will continue to handle the exception and then call the process_exception() method of the other middleware installed until all the middleware has been invoked, then the default exception handling is invoked. If it returns a Response object, the process_response() method of the installed middleware chain is called. Scrapy will not call the process_exception() method of any other middleware. If it returns a Request object, the returned request will be re-invoked for download. This stops the process_exception() method execution of the middleware as it returns a response. Parameters: request (request object) - request that produces exceptions exception (Exception object) - Exception thrown spider (Spider object) - spider corresponding to request
7.10.2 Middleware about user-agent
from mySpider.settings import USER_AGENTS,PROXIES import random from scrapy.spiders import Request class RandomUserAgent(object): def process_request(self,request,spider): useragent = random.choice(USER_AGENTS) request.headers.setdefault('User-Agent',useragent) # Return the corresponding data according to the requirement # return None //settings # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' ]
7.10.3 Middleware Agent
# Upper Agent import base64 class RandomProxy(object): def process_request(self, request, spider): proxy = random.choice(PROXIES) if proxy['user_password'] is None: request.meta['proxy'] = 'http://' + proxy['ip_port'] else: bs64_userpassword = base64.b64encode(proxy['user_password']) request.headers['Proxy-Authorization'] = 'Basic' + bs64_userpassword request.meta['proxy'] = 'http://' + proxy['ip_port'] //settings # Agent settings PROXIES = [ {'ip_port': '127.0.0.3:8000'}, {'ip_port': '127.0.0.3:8000'}, {'ip_port': '127.0.0.3:8000'}, ] PROXIES = [ {'ip_port': '127.0.0.3:8000', 'user_password': 'user1:1234567'}, {'ip_port': '127.0.0.3:8000', 'user_password': 'user1:1234567'}, {'ip_port': '127.0.0.3:8000', 'user_password': 'user1:1234567'} ]
7.10.4 About cookie s
# Disable cookies (enabled by default) COOKIES_ENABLED = False //settings
7.10.5 Set Download Delay
# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3