Why do crawlers use proxy servers?
It can be summarized as follows:
1. When we use python crawler to crawl a website, we usually visit it frequently. Web site anti-crawler technology will check the number of visits to a certain IP at a certain time, if
If you visit too many times, it will disable your IP, so we can set up some proxy servers to help you do your work, and change one proxy every other time, so that it won't come out.
Nowadays, frequent visits lead to the prohibition of visits.
2. Because of the network environment, the direct crawling speed is too slow, but we visit the agent faster, and the agent visits the target website faster, so we use the agent to improve the crawling speed.
Take speed.
3. Due to some local legal or political reasons, some websites can not be accessed directly, and the use of agents to circumvent access restrictions.
Let's talk a little bit about scrapy's ability to crawl a useful Spurs agent.
# -*- coding: utf-8 -*- import scrapy import json #High hiding agent class ProxySpider(scrapy.Spider): name = 'proxy' allowed_domains = ['www.xicidaili.com'] start_urls = ['http://www.xicidaili.com/nn/%s' % i for i in range(1,6)] def parse(self, response): #Position ()> 1 Gets a tag whose tr tag location is greater than 1 for sel in response.css('table#ip_list').xpath('.//tr[position()>1]'): # nth-child(2) gets the second sublabel (note that the order here starts from 1) ip = sel.css('td:nth-child(2)::text').extract_first() #ip port = sel.css('td:nth-child(3)::text').extract_first() #port scheme = sel.css('td:nth-child(6)::text').extract_first() #Type HTTP, https # Splicing agent url proxy = '%s://%s:%s' % (scheme,ip,port) # Define json data meta text meta = { 'proxy':proxy, 'dont_retry':True, #Download only once, fail not to repeat Download 'download_timeout':10, # Set waiting time '_proxy_ip':ip, '_proxy_scheme':scheme } #Check whether the proxy is available for detection by accessing httpbin.org/ip url = '%s://httpbin.org/ip' % scheme yield scrapy.Request(url,callback=self.check,meta=meta,dont_filter=True) def check(self,response): proxy_ip = response.meta['_proxy_ip'] proxy_scheme = response.meta['_proxy_scheme'] #json.loads () returns json text to the original proxy of the dictionary type origin if json.loads(response.text)['origin'] == proxy_ip: yield { 'proxy':response.meta['proxy'], 'scheme':proxy_scheme, }
When running the crawler, put the crawled content in the json file for later use.
scrapy crawl proxy -o proxy_list.json