Multi thread crawling agent and verifying

Preface

One of the most common ways in anti crawler is to determine the frequency of your requests. If you send a large number of requests in a short period of time, whether you are a person or not, first seal your account or IP for a period of time. At this time, in order to achieve the goal of their own crawler, we need to use proxy IP to disguise themselves. Next, we use multithreading to crawl the proxy IP and verify it.

Analysis

The source of this free proxy ip is Free agent IP , see the figure below:

Source code analysis: from the figure below, I choose to extract directly with xpath. Compared with beautiful soup, xpath is not only fast but also brief, with less code.

Code

Grab Code:

def get_info(Queue,flag):
    while Queue:
        url=Queue.get()
        txt=requests.get(url,headers=headers).text
        html=etree.HTML(txt)
        ip=html.xpath('//tr[@class=""]/td[2]/text()')
        for i in ip:
            Queue3.put([i,flag])
        yz(Queue3)

Verification Code:

def yz(Queue):
    while Queue:
        cc=Queue.get()
        ip,flag=cc[0],cc[1]
        try:
            proxies={flag:ip}
            response=requests.get('https://www.baidu.com',proxies=proxies,timeout=2) if flag=='http' else requests.get('http://www.baidu.com',proxies=proxies,timeout=2)
            if response.status_code ==200:
                print(flag,ip,'yes')
            else:
                print(flag,ip,'no')
        except Exception as e:
            print(e)

Full code:

import requests
from lxml import etree
import queue
import threading

Queue1=queue.Queue(23)
Queue2=queue.Queue(18)
Queue3=queue.Queue(10000)
for i in range(1,10):
    Queue1.put("https://www.xicidaili.com/wt/%d"%i)  #Put the ip proxy web page into the queue, which is convenient to use multithreading in the future
for i in range(1,10):
    Queue2.put("https://www.xicidaili.com/wn/%d"%i)

headers={'User-Agent': 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',}

def yz(Queue):
    while Queue:
        cc=Queue.get()
        ip,flag=cc[0],cc[1]
        try:
            proxies={flag:ip}
            response=requests.get('https://www.baidu.com',proxies=proxies,timeout=2) if flag=='http' else requests.get('http://www.baidu.com',proxies=proxies,timeout=2)
            if response.status_code ==200:
                print(flag,ip,'yes')
            else:
                print(flag,ip,'no')
        except Exception as e:
            print(e)

def get_info(Queue,flag):
    while Queue:
        url=Queue.get()
        txt=requests.get(url,headers=headers).text
        html=etree.HTML(txt)
        ip=html.xpath('//tr[@class=""]/td[2]/text()')
        for i in ip:
            Queue3.put([i,flag])
        yz(Queue3)


if __name__ == '__main__':
    for i in range(3):
        th=threading.Thread(target=get_info,args=[Queue1,'http'])
        th.start()
    for i in range(3):
        td=threading.Thread(target=get_info,args=[Queue2,'https'])
        td.start()

Effect screenshots

Reflection and summary

The purpose of reptiles is to crawl useful information. Do not crawl unnecessary information, which can save time and improve efficiency.
When writing a crawler, you should consider the common anti crawler strategy, so that you can save the later change time due to the anti crawler.
Control desire, analyze website. There are tens of thousands of ip agents on this website, all of which are obtained. Obviously, I have made this mistake. Careful analysis of the website is because I found that the ip verification time behind is 2016. As for me, I should choose the latest one. It's not to doubt it's useless, it's just unnecessary.
Get the agent, in addition to this one can also buy ip, but the Internet said that the stability of the purchased ip is not good. Free ip can sometimes fail. In addition, the api is used to obtain ip automatically, but it also has defects. The website corresponding to the api used is introduced.

Steady ahead

19 original articles published, praised 0, 240 visitors

Private letter follow

Posted by Jen_u41 on Wed, 29 Jan 2020 21:34:19 -0800

Programmer Group