Note: Chapter 9 of High Performance Python

Keywords: Python socket less REST

This chapter is about parallel computing, accounting for more than 60 pages. After reading it, we have to straighten out and digest this big lump of things.

1. Using Monte Carlo simulation to estimate pi.

The logic is simple: put the needle in the unit square of the coordinate system, calculate the proportion of the needle falling in the 1 / 4 unit circle (x ^ 2 + y ^ 2 < = 1), and then multiply by 4. It takes about 120 seconds to execute in sequence, throwing 100000000 times.

1.1 multi process acceleration

Each throw is independent of each other, so the workload is directly divided into multiple copies (for example, 8 copies) and handed over to multiple processes for execution. It can be imagined that the functions run here do not need to share state with other parts of the program, and only a small part of data needs to be passed between processes to complete a large number of operations.
Function estimate_nbr_points_in_quarter_circle the number of needles that fall into 1 / 4 unit circle is calculated after repeated injection:

def estimate_points(nbr_estimates):
    in_unit_circle = 0
    for step in range(int(nbr_estimates)):
        x = random.uniform(0, 1)
        y = random.uniform(0, 1)
        is_in_unit_circle = x * x + y * y <= 1.0
        in_unit_circle += is_in_unit_circle
    return in_unit_circle

Parallel part:

from multiprocessing import Pool
...
samples_total = 1e8
N = 8
pool = Pool(processes=N)
samples = samples_total / N
trials = [samples] * N

t1 = time.time()
nbr_in_unit_circles = pool.map(estimate_points, trials)
pi_estimate = sum(nbr_in_unit_circles) * 4 / samples_total

print("Estimate pi", pi_estimate)
print("Delta:", time.time() - t1)

This is a fairly simple parallel program. Determine the number of processes to use (generally the number of CPUs), set the parameters of each process in the trials list, and then pool.map It's like using a normal map function. It took about 19 seconds to run.

(insert, this code in the book is very hard to read, because the variable name is too long, and NBR is everywhere_ trials_ In_ quarter_ unit_ circle´╝înbr_trials_per_process´╝înbr_samples_in_total, a variable name made up of three or four words, makes a very simple thing in a large section and makes people dizzy. This story tells us that the code should be concise and the information density should be high. )

Article 41 of Effective Python points out that multiprocessing costs a lot, because there must be serialization and deserialization between the main process and the sub process. Specifically, multiprocessing does these things:

  1. Pass each item of data in the trials list to map.
  2. The pickle module is used to serialize the data into binary form.
  3. The serialized data is sent through local socket from the process where the main interpreter is located to the process where the child interpreter is located.
  4. In the subprocess, we use pickle to deserialize binary data and restore it to python object.
  5. Introduce include estimate_ The python module of the points function.
  6. Each subprocess runs estimate parallel to its input data_ Points function.
  7. Serialize the run result to bytes.
  8. These bytes are copied to the main process through socket.
  9. The main process deserializes these bytes and restores them to python objects.
  10. The settlement results of each sub process are combined into a list and returned to the caller.

1.2 random number in parallel system

In parallel computing, we need to think about whether we can get repeated or correlated random sequences. If you use python's own random module, multiprocessing will reset the seed of random number generator during each fork. But if you use numpy, you have to reset it yourself, or random will return the same sequence.

Using numpy:

import numpy as np

def estimate_points(samples):
    np.random.seed()
    xs = np.random.uniform(0, 1, samples)
    ys = np.random.uniform(0, 1, samples)
    is_in_quc = (xs * xs + ys * ys) <= 1.0 
    in_quc = np.sum(is_in_quc)
    return in_quc

Using numpy reduces the running time to 1.13 seconds. Numpy is too strong (or CPython is too delicious).

2. Find prime number

It is different to find prime numbers and estimate pi in a large range, because the amount of work is related to the size of the upper and lower limits of the search range (the amount of work to check [10, 100] and [10000, 100000] must be different), and the complexity of checking each number is different (who knows that when the prime factors are checked, they are divided by integers? Even numbers are the easiest to check, and prime numbers are the hardest. The problem is that there is no need to share. The key is how to balance the workload between processes and allocate the tasks with different complexity to the limited computing resources.
When we assign computing tasks to the process pool, we can control how much work is allocated to each process, divide the work into chunks, and assign work to the cpu once it is free. The larger the block, the smaller the communication overhead; the smaller the block, the finer the control. A block size of 10 means that a process checks 10 numbers at a time. The authors present a "block number run time" graph to illustrate the simple principle that "the run time is the shortest when the block number is a multiple of the number of CPUs" (otherwise, the cpu will be free when the last round of calculation is performed).

We can use queues to provide tasks to a group of worker processes and collect results:

ALL_DONE = b"ALL_DONE"
WORKER_FINISHED = b"WORKER_FINISHED"

def check_prime(possibles, definites):
    while True: 
        n = possibles.get()
        if n == ALL_DONE:
            definites.put(WORKER_FINISHED)
            break
        else:
            if n % 2 == 0:
                continue
            for i in range(3, int(math.sqrt(n)) + 1 , 2):
                if n % i == 0:
                    break
            else:
                definites.put(n)

Posibles and definitions are two queues for the input and output of results. We set two flags, ALL_DONE as the sentinel of the termination loop is provided by the parent process after the number is inserted into possibles to tell the child process that there is no other work. Subprocess received all_ After done, output worker to definitions_ Finished, which tells the parent process that it has received the sentinel, and then terminates getting input from the possibles queue.

Create I / O queue and 8 processes, add numbers to possibles queue, and add 8 all at last_ DONEsentinel:

if __name__ == '__main__':
    primes = []
    possibles = Queue()
    definites = Queue()

    N = 8
    pool = Pool(processes=N)
    processes = []
    for _ in range(N):
        p = Process(target=check_prime,args=(possibles, definites))
        processes.append(p)
        p.start()
    
    t1 = time.time()
    
    number_range = range(10000000000, 10000100000)
    for possible in number_range:
        possibles.put(possible)
    print("ALL JOBS ADDED TO THE QUEUE")

    # add poison pills to stop the remote workers
    for n in range(N):
        possibles.put(ALL_DONE)
    print("NOW WAITING FOR RESULTS...")
    ...

Loop through the definitions queue to get the results (of course, the results are not sequential) and get 8 workers_ Stop cycle after finished:

    ...
    processors_finished = 0
    while True:
        new_result = definites.get()
        if new_result == WORKER_FINISHED:
            processors_finished += 1
            print("{} WORKER(S) FINISHED".format(processors_finished))
            if processors_finished == N:
                break
        else:
            primes.append(new_result)
    assert processors_finished == N

    print("Took:", time.time() - t1)
    print(len(primes), primes[:10], primes[-10:])
    

The program took more than seven seconds to execute, and the sequential execution took about 20 seconds. But because creating queues requires serialization and synchronization, the execution speed of multiple processes is not necessarily faster than sequential execution. In the original book, even when the author removes all even numbers from the input queue, multi process execution is slower than sequential execution, which shows that a large part of the time of multi process execution is spent on communication overhead.

3. Verify prime number

Unlike section 2, "finding all prime numbers in a range," let's solve the problem of how to quickly determine whether a particularly large number (such as an 18 digit number) is a prime number - by multiple CPUs working together. This is a problem that requires inter process communication or shared state.

3.1 simple process pool

Similar to the first two examples, we divide the possible factors of the numbers to be checked into multiple groups and pass them to multiple subprocesses for checking. When a factor in a subprocess divides by this number, the subprocess returns False -- but this does not stop other subprocesses (so it's a simple version). This may make other subprocesses idle, but it also saves the communication overhead of checking the shared state.
Group factors:

def create_range(from_i, to_i, N):
    piece_length = int((to_i - from_i) / N)
    lrs = [from_i] 
          + [(i + 1) if (i % 2 == 0) else i 
             for i in range(from_i, to_i, piece_length)[1:]]
    if len(lrs) > N:
        lrs.pop()
    assert len(lrs) == N
    ranges = list(zip(lrs, lrs[1:])) + [(lrs[-1], to_i)]
    return ranges

e.g. create_ The return value of range (1000, 100000, 4) is [(1000, 25751), (25751, 50501), (50501, 75251), (75251, 100000)].

import time
import math
from multiprocessing import Pool


def check_prime_in_range(args):
    n, from_i, to_i = args #It seems that the effect of passing in multiple parameters can only be achieved by passing in tuples and then unpacking
    from_i, to_i = ranges
    if n % 2 == 0:
        return False
    for i in range(from_i, to_i, 2):
        if n % i == 0:
            return False
    return True


def check_prime(n, pool, N):
    from_i = 3
    to_i = int(math.sqrt(n)) + 1
    ranges = create_range(from_i, to_i, N)
    args = [(n, from_i, to_i) for from_i, to_i in ranges]`
    results = pool.map(check_prime_in_range, args)
    if False in results:
        return False
    return True


if __name__ == "__main__":
    N = 8
    pool = Pool(processes=N)
    prime18 = 100109100129100151
    t1 = time.time()
    print("%d: %s" %(prime18, check_prime(prime18, pool, N)))
    print('Took:', time.time() - t1)

It took about 10 seconds.

3.2 slightly less simple process pool

Due to the extra overhead, for a smaller number, the multi process method may not be as good as the sequential search method. Moreover, if a small factor has been found, the program will not stop immediately. Of course, we can communicate between processes immediately at factor, but this can cause a lot of extra communication overhead, because most numbers have a smaller factor. So we use a hybrid strategy: first we look for smaller factors in order, and then we assign the rest of the work to multiple processes. This is a common way to avoid multi process overhead.

def check_prime(n, pool, N):
    from_i = 3
    to_i= 21
    args = (n, from_i, to_i)
    if not check_prime_in_range(args):
        return False
        
    from_i = to_i
    to_i = int(math.sqrt(n)) + 1
    ranges = create_range(from_i, to_i, N)
    args = [(n, from_i, to_i) for from_i, to_i in ranges]
    results = pool.map(check_prime_in_range, args)
    if False in results:
        return False
    return True

3.3 use multiprocessing.Manager() as flag bit

Go straight to the code. You can see here that a symbol bit is created with the Manager. You don't need to do any locking or other operations to read this symbol bit. It's as convenient as checking a global variable (but it's still passed into the function as a parameter). In order to save communication cost, each process is asked to check every 1000 symbol bits. If the process detects a FLAG_SET or find the factor and stop.

import time
import math
from multiprocessing import Pool, Manager


SERIAL_CHECK_CUTOFF = 21
CHECK_EVERY = 1000
FLAG_CLEAR = b'0'
FLAG_SET = b'1'


def create_range(from_i, to_i, N):
    piece_length = int((to_i - from_i) / N)
    lrs = [from_i] + [(i + 1) if (i % 2 == 0) else i for i in range(from_i, to_i, piece_length)[1:]]
    if len(lrs) > N:
        lrs.pop()
        assert len(lrs) == N
    ranges = list(zip(lrs, lrs[1:])) + [(lrs[-1], to_i)]
    return ranges

def check_prime_in_range(args):
    n, from_i, to_i, value = args
    if n % 2 == 0:
        return False
    check_every = CHECK_EVERY
    for i in range(from_i, to_i, 2):
        check_every -= 1
        if not check_every:
            if value.value == FLAG_SET:
                return False
            check_every = CHECK_EVERY

        if n % i == 0:
            value.value = FLAG_SET
            return False
    return True

def check_prime(n, pool, N, value):
    from_i = 3
    to_i= SERIAL_CHECK_CUTOFF
    value.value = FLAG_CLEAR  # Remember to set the value of the flag bit first
    args = (n, from_i, to_i, value)
    if not check_prime_in_range(args):
        return False

    from_i = to_i
    to_i = int(math.sqrt(n)) + 1
    ranges = create_range(from_i, to_i, N)
    args = [(n, from_i, to_i, value) for from_i, to_i in ranges]
    results = pool.map(check_prime_in_range, args)
    if False in results:
        return False
    return True


if __name__ == "__main__":
    N = 8
    manager = Manager()
    value = manager.Value(b'c', FLAG_CLEAR) # Create a symbol flag bit of one byte (one character) size
    pool = Pool(processes=N)
    prime18 = 100109100129100151
    non_prime = 100109100129101027
    t1 = time.time()
    print("%d: %s" %(non_prime, check_prime(non_prime, pool, N, value)))
    print('Took:', time.time()-t1)

Posted by dunnsearch on Thu, 11 Jun 2020 20:43:15 -0700