Asynchronous proxy crawler and proxy pool implemented by Python

Keywords: Python Redis supervisor Session

An asynchronous agent pool is implemented using python asyncio. The free agent on the agent website is crawled according to the rules and stored in redis after verifying its validity. The number of agents is regularly expanded and the validity of agents in the pool is checked to remove invalid agents. At the same time, a server is implemented with aiohttp, and other programs can access the corresponding url to get the agent from the agent pool.

Source code

Github

Environmental Science

  • Python 3.5+
  • Redis
  • Phantom JS (optional)
  • Supervisord (optional)

Because async and await grammars of asyncio are widely used in the code, they are only provided in Python 3.5, so it's better to use Python 3.5 or more, I use Python 3.6.

rely on

  • redis
  • aiohttp
  • bs4
  • lxml
  • requests
  • selenium

The selenium package is mainly used to operate on Phantom JS.

The code is described below.

1. Reptilian part

Core code

async def start(self):
    for rule in self._rules:
        parser = asyncio.ensure_future(self._parse_page(rule)) # Parse pages to get proxies according to rules
        logger.debug('{0} crawler started'.format(rule.__rule_name__))

        if not rule.use_phantomjs:
            await page_download(ProxyCrawler._url_generator(rule), self._pages, self._stop_flag) # Climbing the pages of Real Web Sites
        else:
            await page_download_phantomjs(ProxyCrawler._url_generator(rule), self._pages,
                                          rule.phantomjs_load_flag, self._stop_flag) # Crawling with Phantom JS

        await self._pages.join()

        parser.cancel()

        logger.debug('{0} crawler finished'.format(rule.__rule_name__))

The core code above is actually a production-consumer model implemented with asyncio.Queue. The following is a simple implementation of the model:

import asyncio
from random import random


async def produce(queue, n):
    for x in range(1, n + 1):
        print('produce ', x)
        await asyncio.sleep(random())
        await queue.put(x) # Put item s in queue

async def consume(queue):
   while 1:
       item = await queue.get() # Waiting to get item from queue
       print('consume ', item)
       await asyncio.sleep(random())
       queue.task_done() # Notify queue that the current item has been processed 

async def run(n):
    queue = asyncio.Queue()
    consumer = asyncio.ensure_future(consume(queue))
    await produce(queue, n) # Waiting for the end of the producer
    await queue.join()  # Block until queue is not empty
    consumer.cancel() # Cancel the consumer task, otherwise it will always block the get method


def aio_queue_run(n):
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(run(n)) # Continue running event loop until task run(n) ends
    finally:
        loop.close()

if __name__ == '__main__':
    aio_queue_run(5)

Running the above code, one possible output is as follows:

produce  1
produce  2
consume  1
produce  3
produce  4
consume  2
produce  5
consume  3
consume  4
consume  5

Crawl page

async def page_download(urls, pages, flag):
    url_generator = urls
    async with aiohttp.ClientSession() as session:
        for url in url_generator:
            if flag.is_set():
                break

            await asyncio.sleep(uniform(delay - 0.5, delay + 1))
            logger.debug('crawling proxy web page {0}'.format(url))
            try:
                async with session.get(url, headers=headers, timeout=10) as response:
                    page = await response.text()
                    parsed = html.fromstring(decode_html(page)) # Use bs4 To assist lxml Decoding web pages:http://lxml.de/elementsoup.html#Using only the encoding detection
                    await pages.put(parsed)
                    url_generator.send(parsed) # Get the address of the next page based on the current page
            except StopIteration:
                break
            except asyncio.TimeoutError:
                logger.error('crawling {0} timeout'.format(url))
                continue # TODO: use a proxy
            except Exception as e:
                logger.error(e)

Using aiohttp web crawling function, most proxy websites can use the above method to crawl, for websites using js dynamic page generation can use selenium control PhantomJS To crawl - This project does not require high efficiency of crawlers, the update frequency of proxy websites is limited, do not need frequent crawling, can fully use Phantom JS.

Analytical agent

The simplest way is to parse the proxy with xpath. If you use Chrome browser, you can get the XPath of the selected page elements directly by right-clicking.

Installing Chrome's extension "XPath Helper" allows you to run and debug xpath directly on the page, which is very convenient:

Beautiful Soup does not support xpath. It uses lxml to parse the page. The code is as follows:

async def _parse_proxy(self, rule, page):
    ips = page.xpath(rule.ip_xpath) # Parsing ip address set of list type according to xpath
    ports = page.xpath(rule.port_xpath) # Parsing ip address set of list type according to xpath

    if not ips or not ports:
        logger.warning('{2} crawler could not get ip(len={0}) or port(len={1}), please check the xpaths or network'.
              format(len(ips), len(ports), rule.__rule_name__))
        return

    proxies = map(lambda x, y: '{0}:{1}'.format(x.text.strip(), y.text.strip()), ips, ports)

    if rule.filters: # Filtering agents based on filtering fields, such as "high concealment" and "transparency"
        filters = []
        for i, ft in enumerate(rule.filters_xpath):
            field = page.xpath(ft)
            if not field:
                logger.warning('{1} crawler could not get {0} field, please check the filter xpath'.
                      format(rule.filters[i], rule.__rule_name__))
                continue
            filters.append(map(lambda x: x.text.strip(), field))

        filters = zip(*filters)
        selector = map(lambda x: x == rule.filters, filters)
        proxies = compress(proxies, selector)

    for proxy in proxies:
        await self._proxies.put(proxy) # The parsed proxy is put into asyncio.Queue

Crawler rule

Rules for website crawling, proxy parsing, filtering and so on are defined by the rule classes of each proxy website, which use metaclass and base class to manage the rule classes. The base class is defined as follows:

class CrawlerRuleBase(object, metaclass=CrawlerRuleMeta):

    start_url = None
    page_count = 0
    urls_format = None
    next_page_xpath = None
    next_page_host = ''

    use_phantomjs = False
    phantomjs_load_flag = None

    filters = ()

    ip_xpath = None
    port_xpath = None
    filters_xpath = ()

The meaning of each parameter is as follows:

  • Start_url (required)
    The crawler's start page.
  • Ip_xpath (required)
    Crawl the xpath rule of IP.
  • Port_xpath (required)
    The xpath rule for crawling port numbers.
  • page_count
    Number of pages crawled.
  • urls_format
    The format string of page address generates the address of page n through urls_format.format(start_url, n), which is a common page address format.
  • next_page_xpath,next_page_host
    The URL of the next page is obtained by xpath rule (usually relative path), and the address of the next page is obtained by combining host: next_page_host + url.
  • use_phantomjs, phantomjs_load_flag
    use_phantomjs is used to identify whether Phantom JS is needed to crawl the site. If used, phantomjs_load_flag (an element on the web page, str type) is defined as the symbol of the loading of the Phantom JS page.
  • filters
    Filter field sets, iterative types. Used for filtering agents.
    The xpath rules of each filter field are crawled and correspond to the filter field one by one in order.

The metaclass CrawlerRuleMeta is used to manage the definition of rule classes. For example, if you define use_phantomjs=True, you must define phantomjs_load_flag, otherwise an exception will be thrown, which will not be repeated here.
The rules that have been implemented are West thorn agent,Quick acting,360 agency,66 agency and Secret agent . Adding a new rule class is also simple. You define a new rule class YourRuleClass by inheriting Crawler RuleBase, place it in the proxy pool/rules directory, and add from. import YourRuleClass in the _init_ py directory (so you can get all the rule classes through Crawler RuleBase. _subclasses_(), restart the running proxy pool and apply the new rule.

2. Inspection section

Although there are many free agents, there are not many available ones, so it is necessary to check them after crawling into the agent pool, so that an effective agent can be put into the agent pool, and the agent is also time-effective, and the agent in the pool should be checked regularly to remove the invalid agent in time.
This part is very simple, using aiohttp to access a website through a proxy, if the time-out, the proxy is invalid.

async def validate(self, proxies):
    logger.debug('validator started')
    while 1:
        proxy = await proxies.get()
        async with aiohttp.ClientSession() as session:
            try:
                real_proxy = 'http://' + proxy
                async with session.get(self.validate_url, proxy=real_proxy, timeout=validate_timeout) as resp:
                    self._conn.put(proxy)
            except Exception as e:
                logger.error(e)

        proxies.task_done()

3. server section

A web server is implemented using aiohttp. After startup, the home page can be displayed by visiting http://host:port:

  • Visit http://host:port/get to get a proxy from the proxy pool, such as:'127.0.0.1:1080';
  • Visit http://host:port/get/n to get n agents from the agent pool, such as: "['127.0.0.1:1080','127.0.0.1:443','127.0.1:80']";
  • Visit http://host:port/count to get the capacity of the proxy pool, such as:'42'.

Because the home page is a static html page, in order to avoid the overhead of opening, reading and closing the html file every time a request to visit the home page, it is cached in redis. The modification time of the html file is used to judge whether it has been modified or not. If the modification time is different from that of the redis cache, the html file is read again. And update the cache, otherwise get the content of the home page from redis.
The return agent is implemented by aiohttp.web.Response(text=ip.decode('utf-8'). Text requires str type, while the bytes type is obtained from redis and needs to be converted. Multiple agents returned can be converted to list type using eval.
Returning to the home page is different through aiohttp.web.Response(body=main_page_cache, content_type='text/html'), where body requires bytes type, which directly returns the cache retrieved from redis. conten_type='text/html'is indispensable. Otherwise, the home page cannot be loaded through browser, but will be downloaded when running the sample code in the official document. Note that the sample code basically does not set content_type.
This part is not complicated. Note the points mentioned above. For the path of static resource files used on the home page, you can refer to the previous blog.< Adding static resource paths to aiohttp>.

4. operation

The function of the whole agent pool is divided into three separate parts:

  • proxypool
    Check the capacity of the agent pool regularly. If it is below the lower limit, start the agent crawler and check the agent. The tested crawler is put into the agent pool, and stop the crawler when it reaches the required number.
  • proxyvalidator
    It is used to periodically inspect agents in the agent pool and remove invalid agents.
  • proxyserver
    Start the server.

These three separate tasks run through three processes, which can be managed using supervisod under Linux. Here is an example of supervisord configuration file:

; supervisord.conf
[unix_http_server]
file=/tmp/supervisor.sock   

[inet_http_server]         
port=127.0.0.1:9001       

[supervisord]
logfile=/tmp/supervisord.log 
logfile_maxbytes=5MB       
logfile_backups=10          
loglevel=debug                
pidfile=/tmp/supervisord.pid 
nodaemon=false              
minfds=1024                  
minprocs=200                 

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl=unix:///tmp/supervisor.sock

[program:proxyPool]
command=python /path/to/ProxyPool/run_proxypool.py            
redirect_stderr=true
stdout_logfile=NONE

[program:proxyValidator]
command=python /path/to/ProxyPool/run_proxyvalidator.py
redirect_stderr=true          
stdout_logfile=NONE


[program:proxyServer]
command=python /path/to/ProxyPool/run_proxyserver.py
autostart=false
redirect_stderr=true          
stdout_logfile=NONE

Because the project itself has logs configured, there is no need to capture stdout and stderr with supervisor. When supervisord is started through supervisord-c supervisord.conf, proxy Pool and proxy Server will start automatically. Proxy Server needs to start manually. Visiting http://127.0.0.1:9001, you can manage these three processes through web pages:

supervisod Official documents At present (version 3.3.1) does not support Python 3, but I did not find any problems in using it. Maybe because I did not use the complex function of supervisor, I just used it as a simple tool for process status monitoring and start-stop.

Posted by pennythetuff on Sat, 20 Apr 2019 12:06:33 -0700