Anti-crawling means of scrapy crawler

Keywords: Windows

7.10 Scapy's Anti-Climbing Method

Setting up download middleware (request,response)

1. Writing and Downloading Middleware

2. Activate in settings:

DOWNLOADER_MIDDLEWARES = {   'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,}

7.10.1 Writing of Middleware

process_request(request, spider)
This method is called when each request is downloaded through the middleware.

process_request() must return one of them: None, Response, Request or raise IgnoreRequest.

If it returns to None, Scrapy will continue to process the request and execute the corresponding methods of other middleware until the appropriate download handler is called and the request is executed (its response is downloaded).

If it returns the Response object, Scrapy will not call any other process_request() or process_exception() method, or download the function accordingly; it will return the response. The process_response() method of the installed middleware is invoked when each response returns.

If it returns the Request object, Scrapy stops calling the process_request method and reschedules the returned request. When the newly returned request is executed, the corresponding middleware chain will be invoked according to the downloaded response.

If its raise has an IgnoreRequest exception, the process_exception() method of the installed download middleware will be invoked. If no method handles the exception, the request's errback(Request.errback) method is called. If no code handles the thrown exception, it is ignored and not recorded (unlike other exceptions).

Parameters:	
request (Request object) - Processing request
 spider (Spider object) - The spider corresponding to the request

process_response(request, response, spider)
process_request() must return one of the following: a Response object, a Request object, or an IgnoreRequest exception for raise.

If it returns a Response (which can be the same as the incoming response or a completely new object), the response will be processed by the process_response() method of other middleware in the chain.

If it returns a Request object, the middleware chain stops and the returned request is rescheduled for download. Processing is similar to what process_request() does when it returns request.

If it throws an IgnoreRequest exception, the request's errback(Request.errback) is called. If no code handles the thrown exception, it is ignored and not recorded (unlike other exceptions).

Parameters:	
request (Request object) - the corresponding request of response
 response (Response Object) - Responsed to be processed
 spider (Spider object) - spider corresponding to response
process_exception(request, exception, spider)
Scrapy calls process_exception() when an exception (including IgnoreRequest exception) is thrown by the download handler or process_request() download middleware.

process_exception() should return one of the following: None, a Response object, or a Request object.

If it returns to None, Scrapy will continue to handle the exception and then call the process_exception() method of the other middleware installed until all the middleware has been invoked, then the default exception handling is invoked.

If it returns a Response object, the process_response() method of the installed middleware chain is called. Scrapy will not call the process_exception() method of any other middleware.

If it returns a Request object, the returned request will be re-invoked for download. This stops the process_exception() method execution of the middleware as it returns a response.

Parameters:	
request (request object) - request that produces exceptions
 exception (Exception object) - Exception thrown
 spider (Spider object) - spider corresponding to request

7.10.2 Middleware about user-agent

from mySpider.settings import USER_AGENTS,PROXIES
import random
from scrapy.spiders import Request
class RandomUserAgent(object):
    def process_request(self,request,spider):
        useragent = random.choice(USER_AGENTS)
        request.headers.setdefault('User-Agent',useragent)


        # Return the corresponding data according to the requirement
        # return None
        
//settings
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
]

7.10.3 Middleware Agent

# Upper Agent
import base64
class RandomProxy(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        if proxy['user_password'] is None:
            request.meta['proxy'] = 'http://' + proxy['ip_port']
        else:
            bs64_userpassword = base64.b64encode(proxy['user_password'])
            request.headers['Proxy-Authorization'] = 'Basic' + bs64_userpassword
            request.meta['proxy'] = 'http://' + proxy['ip_port']
            
            
//settings
# Agent settings
PROXIES = [
    {'ip_port': '127.0.0.3:8000'},
    {'ip_port': '127.0.0.3:8000'},
    {'ip_port': '127.0.0.3:8000'},
]
PROXIES = [
    {'ip_port': '127.0.0.3:8000', 'user_password': 'user1:1234567'},
    {'ip_port': '127.0.0.3:8000', 'user_password': 'user1:1234567'},
    {'ip_port': '127.0.0.3:8000', 'user_password': 'user1:1234567'}
]

7.10.4 About cookie s

# Disable cookies (enabled by default)
COOKIES_ENABLED = False
//settings

7.10.5 Set Download Delay

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3

Posted by swjohnson on Thu, 25 Jul 2019 23:43:56 -0700