Code details Python multithreading, multiprocessing, and orchestration

Keywords: Python Programming

Yunqi information:[ Click to see more industry information]
Here you can find the first-hand cloud information of different industries. What are you waiting for? Come on!

I. Preface

A lot of times we write a crawler, after realizing the requirements, we will find a lot to improve, and one of the most important points is the crawl speed. This article explains how to use multi process, multi thread and co process to improve the crawling speed through code. Note: we don't go into theory and principle, everything is in the code.

Two. Synchronization

First of all, we write a simplified crawler, subdivide each function, and consciously carry out functional programming. The purpose of the following code is to visit the baidu page 300 times and return the status code, where the parse UU 1 function can set the number of cycles, and each cycle will pass the current number of cycles (starting from 0) and url into the parse UU 2 function.

import requests 
 
def parse_1(): 
    url = 'https://www.baidu.com' 
    for i in range(300): 
        parse_2(url) 
 
def parse_2(url): 
    response = requests.get(url) 
    print(response.status_code) 
 
if __name__ == '__main__': 
    parse_1()

The consumption of performance is mainly in IO requests. When a URL is requested in a single process and single thread mode, it will inevitably cause a wait

The sample code is a typical serial logic. Parse ﹣ 1 passes the url and the number of cycles to parse ﹣ 2. Parse ﹣ 2 requests and returns the status code. Parse ﹣ 1 continues to iterate once, repeating the previous steps

3, Multithreading

Because there is only one thread on each timescale when the CPU executes the program, multithreading actually increases the utilization rate of the process and thus the utilization rate of the CPU

There are many libraries to implement multithreading. Here we use ThreadPoolExecutor in concurrent.futures to demonstrate. The ThreadPoolExecutor library is introduced because it is simpler than other library codes

In order to explain the problem easily, if the following code is a newly added part, a > symbol will be added before the code line to facilitate the observation and explanation of the problem. The actual operation needs to be removed

import requests 
> from concurrent.futures import ThreadPoolExecutor 
 
def parse_1(): 
    url = 'https://www.baidu.com' 
    # Set up thread pool 
    > pool = ThreadPoolExecutor(6) 
    for i in range(300): 
        > pool.submit(parse_2, url) 
    > pool.shutdown(wait=True) 
 
def parse_2(url): 
    response = requests.get(url) 
    print(response.status_code) 
 
if __name__ == '__main__': 
    parse_1()

The opposite of synchronization is asynchrony. Asynchrony is to be independent of each other and continue to do their own things in the process of waiting for an event without waiting for the completion of the event before work. Thread is a way to realize asynchrony. That is to say, if multithreading is asynchronous, it means that we don't know the result of processing. Sometimes we need to know the result of processing, so we can use callback

import requests 
from concurrent.futures import ThreadPoolExecutor 
 
# Add callback function 
> def callback(future): 
    > print(future.result()) 
 
def parse_1(): 
    url = 'https://www.baidu.com' 
    pool = ThreadPoolExecutor(6) 
    for i in range(300): 
        > results = pool.submit(parse_2, url) 
        # Key steps of callback 
        > results.add_done_callback(callback) 
    pool.shutdown(wait=True) 
 
def parse_2(url): 
    response = requests.get(url) 
    print(response.status_code) 
 
if __name__ == '__main__': 
    parse_1()

Python's implementation of multithreading has a Gil (global interpreter lock) that has been criticized by many people, but multithreading is still very suitable for crawling web pages, which are mostly IO intensive tasks.

4, Multiprocess

Multiprocessing is implemented in two ways: ProcessPoolExecutor and multiprocessing

1. ProcessPoolExecutor

Similar to the ThreadPoolExecutor that implements multithreading

import requests 
> from concurrent.futures import ProcessPoolExecutor 
 
def parse_1(): 
    url = 'https://www.baidu.com' 
    # Set up thread pool 
    > pool = ProcessPoolExecutor(6) 
    for i in range(300): 
        > pool.submit(parse_2, url) 
    > pool.shutdown(wait=True) 
 
def parse_2(url): 
    response = requests.get(url) 
    print(response.status_code) 
 
if __name__ == '__main__': 
    parse_1()

You can see that the class name has been changed twice, and the code is still very simple. Similarly, you can add a callback function

import requests 
from concurrent.futures import ProcessPoolExecutor 
 
> def callback(future): 
    > print(future.result()) 
 
def parse_1(): 
    url = 'https://www.baidu.com' 
    pool = ProcessPoolExecutor(6) 
    for i in range(300): 
        > results = pool.submit(parse_2, url) 
        > results.add_done_callback(callback) 
    pool.shutdown(wait=True) 
 
def parse_2(url): 
    response = requests.get(url) 
    print(response.status_code) 
 
if __name__ == '__main__': 
    parse_1()

2. multiprocessing

Look directly at the code. Everything is in the comments.

import requests 
> from multiprocessing import Pool 
 
def parse_1(): 
    url = 'https://www.baidu.com' 
    # Building pool 
    > pool = Pool(processes=5) 
    # Results of storage 
    > res_lst = [] 
    for i in range(300): 
        # Add tasks to the pool 
        > res = pool.apply_async(func=parse_2, args=(url,)) 
        # Get finished results (need to be taken out) 
        > res_lst.append(res) 
    # Store the final result (or directly store or print) 
    > good_res_lst = [] 
    > for res in res_lst: 
        # Using get to get the result after processing 
        > good_res = res.get() 
        # Judge the result 
        > if good_res: 
            > good_res_lst.append(good_res) 
    # Shut down and wait for completion 
    > pool.close() 
    > pool.join() 
 
def parse_2(url): 
    response = requests.get(url) 
    print(response.status_code) 
 
if __name__ == '__main__': 
    parse_1()

You can see that the code of multiprocessing library is a little tedious, but it supports more expansion. Multiprocess and multithreading can achieve the goal of acceleration, but if encountering IO blocking, there will be waste of threads or processes, so there is a better way

5, Asynchronous non blocking

Cooperation + callback and dynamic cooperation can achieve the goal of asynchronous and non blocking. In essence, only one thread is used, so resources are used to a great extent

The classic way to realize asynchronous non blocking is to use asyncio library + yield. In order to facilitate the use of higher-level encapsulation aiohttp, it is better to understand asyncio library better. gevent is a very convenient library to implement the cooperation process

import requests 
> from gevent import monkey 
# Monkey patch is the soul of collaborative operation 
> monkey.patch_all() 
> import gevent 
 
def parse_1(): 
    url = 'https://www.baidu.com' 
    # Create task list 
    > tasks_list = [] 
    for i in range(300): 
        > task = gevent.spawn(parse_2, url) 
        > tasks_list.append(task) 
    > gevent.joinall(tasks_list) 
 
def parse_2(url): 
    response = requests.get(url) 
    print(response.status_code) 
 
if __name__ == '__main__': 
    parse_1()

Gevent can greatly speed up, but also introduces new problems: what if we don't want to cause too much burden to the server? If it is a multi process and multi-threaded pool building method, we can control the number of pools. If gevent wants to control the speed, there is a good way: to establish a queue. The Quene class is also provided in gevent. The following code changes a lot

import requests 
from gevent import monkey 
monkey.patch_all() 
import gevent 
> from gevent.queue import Queue 
 
def parse_1(): 
    url = 'https://www.baidu.com' 
    tasks_list = [] 
    # Instantiate queue 
    > quene = Queue() 
    for i in range(300): 
        # All URLs are pushed into the queue 
        > quene.put_nowait(url) 
    # Two way queue 
    > for _ in range(2): 
        > task = gevent.spawn(parse_2) 
        > tasks_list.append(task) 
    gevent.joinall(tasks_list) 
 
# No need to pass in parameters, all in the queue 
> def parse_2(): 
    # Loop to determine whether the queue is empty 
    > while not quene.empty(): 
        # Pop-up queue 
        > url = quene.get_nowait() 
        response = requests.get(url) 
        # Judge queue status 
        > print(quene.qsize(), response.status_code) 
 
if __name__ == '__main__': 
    parse_1()

Concluding remarks

These are several commonly used acceleration methods. If you are interested in code testing, you can use the time module to determine the running time. Reptile acceleration is an important skill, but proper speed control is also a good habit of Reptilians. Don't put too much pressure on the server. Bye

[yunqi online class] product technology experts share every day!
Course address: https://yqh.aliyun.com/zhibo

Join the community immediately, face to face with experts, and keep abreast of the latest news of the course!
[yunqi online classroom community] https://c.tb.cn/F3.Z8gvnK

Original release time: April 7, 2020
Author: Chen Xi
This article comes from:“ Get up early Python ”, you can pay attention to“ Get up early Python"

Posted by wlpywd on Mon, 06 Apr 2020 22:35:59 -0700