Using scrapy to crawl useful free Spur Agents

Keywords: Programming JSON Python network

Why do crawlers use proxy servers?

It can be summarized as follows:

1. When we use python crawler to crawl a website, we usually visit it frequently. Web site anti-crawler technology will check the number of visits to a certain IP at a certain time, if

If you visit too many times, it will disable your IP, so we can set up some proxy servers to help you do your work, and change one proxy every other time, so that it won't come out.

Nowadays, frequent visits lead to the prohibition of visits.

2. Because of the network environment, the direct crawling speed is too slow, but we visit the agent faster, and the agent visits the target website faster, so we use the agent to improve the crawling speed.

Take speed.

3. Due to some local legal or political reasons, some websites can not be accessed directly, and the use of agents to circumvent access restrictions.

Let's talk a little bit about scrapy's ability to crawl a useful Spurs agent.

# -*- coding: utf-8 -*-
import scrapy
import json

#High hiding agent
class ProxySpider(scrapy.Spider):
    name = 'proxy'
    allowed_domains = ['www.xicidaili.com']
    start_urls = ['http://www.xicidaili.com/nn/%s' % i for i in range(1,6)]

    def parse(self, response):
        #Position ()> 1 Gets a tag whose tr tag location is greater than 1
        for sel in response.css('table#ip_list').xpath('.//tr[position()>1]'):
            # nth-child(2) gets the second sublabel (note that the order here starts from 1)
            ip = sel.css('td:nth-child(2)::text').extract_first()   #ip
            port = sel.css('td:nth-child(3)::text').extract_first()  #port
            scheme = sel.css('td:nth-child(6)::text').extract_first()  #Type HTTP, https

            # Splicing agent url
            proxy = '%s://%s:%s' % (scheme,ip,port)

            # Define json data meta text
            meta = {
                'proxy':proxy,
                'dont_retry':True,        #Download only once, fail not to repeat Download
                'download_timeout':10,    # Set waiting time 

                '_proxy_ip':ip,
                '_proxy_scheme':scheme
            }

            #Check whether the proxy is available for detection by accessing httpbin.org/ip
            url = '%s://httpbin.org/ip' % scheme
            yield scrapy.Request(url,callback=self.check,meta=meta,dont_filter=True)

    def check(self,response):
        proxy_ip = response.meta['_proxy_ip']
        proxy_scheme = response.meta['_proxy_scheme']

        #json.loads () returns json text to the original proxy of the dictionary type origin
        if json.loads(response.text)['origin'] == proxy_ip:
            yield {
                'proxy':response.meta['proxy'],
                'scheme':proxy_scheme,
            }

When running the crawler, put the crawled content in the json file for later use.

scrapy crawl proxy -o proxy_list.json

Posted by gregsmith on Mon, 28 Jan 2019 15:15:14 -0800

Programmer Group

Using scrapy to crawl useful free Spur Agents

Hot Keywords