Yunqi information:[ Click to see more industry information]
Here you can find the first-hand cloud information of different industries. What are you waiting for? Come on!
I. Preface
A lot of times we write a crawler, after realizing the requirements, we will find a lot to improve, and one of the most important points is the crawl speed. This article explains how to use multi process, multi thread and co process to improve the crawling speed through code. Note: we don't go into theory and principle, everything is in the code.
Two. Synchronization
First of all, we write a simplified crawler, subdivide each function, and consciously carry out functional programming. The purpose of the following code is to visit the baidu page 300 times and return the status code, where the parse UU 1 function can set the number of cycles, and each cycle will pass the current number of cycles (starting from 0) and url into the parse UU 2 function.
import requests def parse_1(): url = 'https://www.baidu.com' for i in range(300): parse_2(url) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
The consumption of performance is mainly in IO requests. When a URL is requested in a single process and single thread mode, it will inevitably cause a wait
The sample code is a typical serial logic. Parse ﹣ 1 passes the url and the number of cycles to parse ﹣ 2. Parse ﹣ 2 requests and returns the status code. Parse ﹣ 1 continues to iterate once, repeating the previous steps
3, Multithreading
Because there is only one thread on each timescale when the CPU executes the program, multithreading actually increases the utilization rate of the process and thus the utilization rate of the CPU
There are many libraries to implement multithreading. Here we use ThreadPoolExecutor in concurrent.futures to demonstrate. The ThreadPoolExecutor library is introduced because it is simpler than other library codes
In order to explain the problem easily, if the following code is a newly added part, a > symbol will be added before the code line to facilitate the observation and explanation of the problem. The actual operation needs to be removed
import requests > from concurrent.futures import ThreadPoolExecutor def parse_1(): url = 'https://www.baidu.com' # Set up thread pool > pool = ThreadPoolExecutor(6) for i in range(300): > pool.submit(parse_2, url) > pool.shutdown(wait=True) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
The opposite of synchronization is asynchrony. Asynchrony is to be independent of each other and continue to do their own things in the process of waiting for an event without waiting for the completion of the event before work. Thread is a way to realize asynchrony. That is to say, if multithreading is asynchronous, it means that we don't know the result of processing. Sometimes we need to know the result of processing, so we can use callback
import requests from concurrent.futures import ThreadPoolExecutor # Add callback function > def callback(future): > print(future.result()) def parse_1(): url = 'https://www.baidu.com' pool = ThreadPoolExecutor(6) for i in range(300): > results = pool.submit(parse_2, url) # Key steps of callback > results.add_done_callback(callback) pool.shutdown(wait=True) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
Python's implementation of multithreading has a Gil (global interpreter lock) that has been criticized by many people, but multithreading is still very suitable for crawling web pages, which are mostly IO intensive tasks.
4, Multiprocess
Multiprocessing is implemented in two ways: ProcessPoolExecutor and multiprocessing
1. ProcessPoolExecutor
Similar to the ThreadPoolExecutor that implements multithreading
import requests > from concurrent.futures import ProcessPoolExecutor def parse_1(): url = 'https://www.baidu.com' # Set up thread pool > pool = ProcessPoolExecutor(6) for i in range(300): > pool.submit(parse_2, url) > pool.shutdown(wait=True) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
You can see that the class name has been changed twice, and the code is still very simple. Similarly, you can add a callback function
import requests from concurrent.futures import ProcessPoolExecutor > def callback(future): > print(future.result()) def parse_1(): url = 'https://www.baidu.com' pool = ProcessPoolExecutor(6) for i in range(300): > results = pool.submit(parse_2, url) > results.add_done_callback(callback) pool.shutdown(wait=True) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
2. multiprocessing
Look directly at the code. Everything is in the comments.
import requests > from multiprocessing import Pool def parse_1(): url = 'https://www.baidu.com' # Building pool > pool = Pool(processes=5) # Results of storage > res_lst = [] for i in range(300): # Add tasks to the pool > res = pool.apply_async(func=parse_2, args=(url,)) # Get finished results (need to be taken out) > res_lst.append(res) # Store the final result (or directly store or print) > good_res_lst = [] > for res in res_lst: # Using get to get the result after processing > good_res = res.get() # Judge the result > if good_res: > good_res_lst.append(good_res) # Shut down and wait for completion > pool.close() > pool.join() def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
You can see that the code of multiprocessing library is a little tedious, but it supports more expansion. Multiprocess and multithreading can achieve the goal of acceleration, but if encountering IO blocking, there will be waste of threads or processes, so there is a better way
5, Asynchronous non blocking
Cooperation + callback and dynamic cooperation can achieve the goal of asynchronous and non blocking. In essence, only one thread is used, so resources are used to a great extent
The classic way to realize asynchronous non blocking is to use asyncio library + yield. In order to facilitate the use of higher-level encapsulation aiohttp, it is better to understand asyncio library better. gevent is a very convenient library to implement the cooperation process
import requests > from gevent import monkey # Monkey patch is the soul of collaborative operation > monkey.patch_all() > import gevent def parse_1(): url = 'https://www.baidu.com' # Create task list > tasks_list = [] for i in range(300): > task = gevent.spawn(parse_2, url) > tasks_list.append(task) > gevent.joinall(tasks_list) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
Gevent can greatly speed up, but also introduces new problems: what if we don't want to cause too much burden to the server? If it is a multi process and multi-threaded pool building method, we can control the number of pools. If gevent wants to control the speed, there is a good way: to establish a queue. The Quene class is also provided in gevent. The following code changes a lot
import requests from gevent import monkey monkey.patch_all() import gevent > from gevent.queue import Queue def parse_1(): url = 'https://www.baidu.com' tasks_list = [] # Instantiate queue > quene = Queue() for i in range(300): # All URLs are pushed into the queue > quene.put_nowait(url) # Two way queue > for _ in range(2): > task = gevent.spawn(parse_2) > tasks_list.append(task) gevent.joinall(tasks_list) # No need to pass in parameters, all in the queue > def parse_2(): # Loop to determine whether the queue is empty > while not quene.empty(): # Pop-up queue > url = quene.get_nowait() response = requests.get(url) # Judge queue status > print(quene.qsize(), response.status_code) if __name__ == '__main__': parse_1()
Concluding remarks
These are several commonly used acceleration methods. If you are interested in code testing, you can use the time module to determine the running time. Reptile acceleration is an important skill, but proper speed control is also a good habit of Reptilians. Don't put too much pressure on the server. Bye
[yunqi online class] product technology experts share every day!
Course address: https://yqh.aliyun.com/zhibo
Join the community immediately, face to face with experts, and keep abreast of the latest news of the course!
[yunqi online classroom community] https://c.tb.cn/F3.Z8gvnK
Original release time: April 7, 2020
Author: Chen Xi
This article comes from:“ Get up early Python ”, you can pay attention to“ Get up early Python"