I've written about using single thread to build proxy ip pools before, but you will soon find that it's too slow to test proxy IPS one by one with single thread. It takes a long time to finish running once and it's totally unbearable. So this article is to use multithreading to build the ip pool, which is much faster than using single thread. The reason for using multi-threading instead of multi-process is that the test time is mainly spent waiting for the network to transfer data, the processing time of local computing is very short, multi-threading can better perform single-core performance, and the cost of multi-threading is much less than that of multi-process. Of course, there will be limits to single-core performance, and if you want to improve performance, you need to mix multi-process and multi-threading. Of course, this is about the use of CPython as an interpreter, because most people use CPython, so the following is the case.
Limited by personal knowledge, the understanding of multi-process and multi-threading is not very profound, if there is a chance to write articles on concurrent programming in the future. CPython can't perform multi-core performance because of GIL lock, but it can use multi-process to perform multi-core performance. Note that GIL locks are not a python language feature, but the reason for CPython interpreters. Any Python thread must acquire a GIL lock before it can execute, and for every 100 bytecodes executed, the interpreter automatically releases the GIL lock for other threads to execute. So Python threads can only be executed alternately, even if multiple threads run on a multi-core CPU, they can only use one core.
In fact, the main body of the program has been written in the previous article, what we need is only a little improvement to suit multi-threaded programming. My idea is to set up a thread to crawl the ip to be tested, and other threads to get the ip to be tested for testing. This is also the idea of distributed programming.
We first set up a queue to store the ip to be tested.
thread_lock = threading.Lock() test_ip_list = Queue()
Then make some modifications to the previous function.
def download_page(url, timeout=10): headers=hidden_reptile.random_header() data = requests.get(url, headers=headers, timeout=timeout) return data def test_ip(test_url): while True: if test_ip_list.empty(): return ip = test_ip_list.get() proxies = { 'http': ip[0]+':'+ip[1], 'https': ip[0] + ':' + ip[1] } try_ip = ip[0] try: r=requests.get(test_url,headers=hidden_reptile.random_header(),proxies=proxies,timeout=10) if r.status_code == 200: r.encoding = 'gbk' result=re.search('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',r.text) result=result.group() if result[:9]==try_ip[:9]: print('%s:%s Test pass' % (ip[0],ip[1])) thread_lock.acquire() with open('proxy_ip.txt', 'a') as f: f.write(ip[0] + ':' + ip[1] + '\n') thread_lock.release() else: print('%s:%s Failure of carrying agent,Local use is used IP' %(ip[0],ip[1])) else: print('%s:%s Request code is not 200' %(ip[0],ip[1])) except Exception as e: print(e) print('%s:%s error' %(ip[0],ip[1])) def get_proxies(page_num, ip_url_list): for ip_url in ip_url_list: for page in range(1, page_num+1): print("Grasping the first%d Page proxy IP" %page) url= ip_url.format(page) r=download_page(url) r.encoding='utf-8' pattern = re.compile('<td class="country">.*?alt="Cn" />.*?</td>.*?<td>(.*?)</td>.*?<td>(.*?)</td>', re.S) ip_list= re.findall(pattern, r.text) for ip in ip_list: test_ip_list.put(ip) time.sleep(10) print('{}Grab end'.format(ip_url))
Note that when writing to a file, you need to add a process lock, because it is written to the same file. If you do not use thread lock, one thread may write to half of the file and be robbed by other threads, and then write to other things. All the IPS to be tested come from the python queue test_ip_list, which is operated without adding thread locks because it comes with thread locks.
Finally, write the running part.
if __name__ == '__main__': number_of_threads = 8 total_pages = 20 threads = [] url = ["http://www.xicidaili.com/nt/{}"] test_url = 'http://ip.tool.chinaz.com/' t = threading.Thread(target=get_proxies, args=(total_pages, url)) t.setDaemon(True) t.start() threads.append(t) time.sleep(1) for i in range(1, number_of_threads): t = threading.Thread(target=test_ip, args=(test_url,)) t.setDaemon(True) threads.append(t) t.start() for thread in threads: thread.join()
If there are other ip-crawling sites that can be added to the url list, total_page is the total number of pages crawled. Pause 1 s after opening the first thread, waiting for it to add the IP to be tested into the queue.