One line of Python code for parallel, Sao skills, Get!

Keywords: Python Java Scala Google

Python is somewhat notorious for parallelizing programs.Regardless of technical issues, such as thread implementation and GIL, I think incorrect instructional guidance is the main problem.Common classic Python multi-threaded, multi-process tutorials tend to be "heavy".They often scratch their boots and don't delve into what's most useful in their daily work.

Traditional examples

With a simple search for the Python Multithreaded Tutorial, it is not difficult to find that almost all tutorials give examples involving classes and queues:

import os 
import PIL

from multiprocessing import Pool 
from PIL import Image

SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'

defget_image_paths(folder):
return (os.path.join(folder, f) 
for f in os.listdir(folder) 
if'jpeg'in f)

defcreate_thumbnail(filename):
im = Image.open(filename)
im.thumbnail(SIZE, Image.ANTIALIAS)
base, fname = os.path.split(filename) 
save_path = os.path.join(base, SAVE_DIRECTORY, fname)
im.save(save_path)

if __name__ == '__main__':
folder = os.path.abspath(
'11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
os.mkdir(os.path.join(folder, SAVE_DIRECTORY))

images = get_image_paths(folder)

pool = Pool
pool.map(creat_thumbnail, images)
pool.close
pool.join

Ha, it looks like Java, doesn't it?

I am not saying that it is wrong to use the producer/consumer model for multithreaded/multiprocess tasks (in fact, this model has its own uses).It's just that we can use more efficient models when working with everyday scripting tasks.

The problem is that...

First, you need a template class;
Second, you need a queue to pass objects;
Additionally, you need to build methods on both ends of the channel to help it work (and bring in another queue if you want to communicate in both directions or save the results).

The more worker s, the more problems

With this in mind, you now need a thread pool for a worker thread.Here's an example from an IBM classic tutorial that uses multithreading to speed up Web page retrieval.

If you want to find a Python learning environment, you can join our Python learning circle: 784758214

#Example2.py
'''
A more realistic thread pool example 
'''

import time 
import threading 
import Queue 
import urllib2

classConsumer(threading.Thread):
def__init__(self, queue):
threading.Thread.__init__(self)
self._queue = queue

defrun(self):
whileTrue: 
content = self._queue.get 
if isinstance(content, str) and content == 'quit':
break
response = urllib2.urlopen(content)
print'Bye byes!'

defProducer:
urls = [
'http://www.python.org', 'http://www.yahoo.com'
'http://www.scala.org', 'http://www.google.com'
# etc.. 
]
queue = Queue.Queue
worker_threads = build_worker_pool(queue, 4)
start_time = time.time

# Add the urls to process
for url in urls: 
queue.put(url) 
# Add the poison pillv
for worker in worker_threads:
queue.put('quit')

worker.join

print'Done! Time taken: {}'.format(time.time - start_time)

defbuild_worker_pool(queue, size):
workers = 
for _ in range(size):
worker = Consumer(queue)
worker.start 
workers.append(worker)
return workers

if __name__ == '__main__':
Producer

This code works correctly, but take a closer look at what we need to do: construct different methods, track a series of threads, and, to solve the annoying deadlock problem, do a series of join operations.This is just the beginning...

So far we've reviewed the classic multithreaded tutorials, are they somewhat empty?Template and error-prone, so the style of doing half the work is obviously not suitable for everyday use, but there are better ways.

Why not try map

Map, a compact and sophisticated function, is the key to a simple parallelization of Python programs.Map is derived from a functional programming language such as Lisp.It can map two functions through a sequence.

urls = ['http://www.yahoo.com', 'http://www.reddit.com']
results = map(urllib2.urlopen, urls)

The above two lines of code pass each element in the urls sequence as a parameter to the urlopen method and save all results to the results list.The result is roughly equivalent to:

results = 
for url in urls: 
results.append(urllib2.urlopen(url))

The map function handles a series of operations, such as sequence operation, parameter transfer and result preservation.

Why is this important?This is because with the right libraries, map s can easily be parallelized.

There are two libraries in Python that contain map functions: multiprocessing and its little-known sublibrary, multiprocessing.dummy.

Two more sentences here: multiprocessing.dummy?Thread version clone of mltiprocessing library?Is this shrimp rice?Even in the official documentation of the multiprocessing library, there is only one description of this sublibrary.This description, translated into Mandarin, basically says, "Well, with this, you know it's done." Trust me, this library is severely underestimated!

Dummy is a complete clone of the multiprocessing module. The only difference is that multiprocessing acts on processes, while the dummy module acts on threads (and therefore includes all the common multithreading limitations of Python).
So it's very easy to replace these two libraries.You can choose different libraries for IO-intensive tasks and CPU-intensive tasks.

Try it by hand

Use the following two lines of code to reference the library that contains the parallelized map function:

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

Instantiate the Pool object:

pool = ThreadPool

This simple statement replaces the 7-line code for the buildworkerpool function in example2.py.It generates a series of worker threads, completes initialization, and stores them in variables for easy access.

The Pool object has some parameters, and all I need to focus on here is its first parameter: processes. This parameter sets the number of threads in the thread pool.Its default value is the number of cores in the current machine CPU.

In general, when performing CPU-intensive tasks, the more cores you invoke, the faster you will be able to do so.However, when dealing with network-intensive tasks, things are somewhat unexpected, and it is wise to experiment to determine the size of the thread pool.

pool = ThreadPool(4) # Sets the pool size to 4

When there are too many threads, switching threads may even take longer than actual working time.For different jobs, it's a good idea to try to find the optimal size of the thread pool.

Once a Pool object has been created, a parallelized program is on its way.Let's take a look at the rewritten example2.py

import urllib2 
from multiprocessing.dummy import Pool as ThreadPool

urls = [
'http://www.python.org', 
'http://www.python.org/about/',
'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
'http://www.python.org/doc/',
'http://www.python.org/download/',
'http://www.python.org/getit/',
'http://www.python.org/community/',
'https://wiki.python.org/moin/',
'http://planet.python.org/',
'https://wiki.python.org/moin/LocalUserGroups',
'http://www.python.org/psf/',
'http://docs.python.org/devguide/',
'http://www.python.org/community/awards/'
# etc.. 
]

# Make the Pool of workers
pool = ThreadPool(4) 
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)
#close the pool and wait for the work to finish 
pool.close 
pool.join
//There are only four lines of code that actually work, and only one of them is critical.The map function easily replaces the 40-line example above.To be more interesting, I've counted the time consumed by different methods and different thread pool sizes.

# results = 
# for url in urls:
# result = urllib2.urlopen(url)
# results.append(result)

# # ------- VERSUS ------- #

# # ------- 4 Pool ------- # 
# pool = ThreadPool(4) 
# results = pool.map(urllib2.urlopen, urls)

# # ------- 8 Pool ------- #

# pool = ThreadPool(8)

# # ------- 13 Pool ------- #

# pool = ThreadPool(13) 
//Result:

# Single thread: 14.4 Seconds 
# 4 Pool: 3.1 Seconds
# 8 Pool: 1.4 Seconds
# 13 Pool: 1.3 Seconds

Great results, huh?This result also explains why experimentation is needed to determine the size of the thread pool.The benefits on my machine are limited when the thread pool size is greater than 9.

Another real example

Generate thumbnails of thousands of pictures

This is a CPU intensive task and is well suited for parallelization.

Base single process version

import os 
import PIL

from multiprocessing import Pool 
from PIL import Image

SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'

defget_image_paths(folder):
return (os.path.join(folder, f) 
for f in os.listdir(folder) 
if'jpeg'in f)

defcreate_thumbnail(filename):
im = Image.open(filename)
im.thumbnail(SIZE, Image.ANTIALIAS)
base, fname = os.path.split(filename) 
save_path = os.path.join(base, SAVE_DIRECTORY, fname)
im.save(save_path)

if __name__ == '__main__':
folder = os.path.abspath(
'11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
os.mkdir(os.path.join(folder, SAVE_DIRECTORY))

images = get_image_paths(folder)

for image in images:
create_thumbnail(Image)

The main work of the code above is to iterate through the picture files in the incoming folder, generate thumbnails one by one, and save them in a specific folder.

This program takes 27.9 seconds to process 6,000 pictures on my machine.

If we use the map function instead of the for loop:

5.6 seconds!

Although only a few lines of code have been changed, we have significantly improved the execution speed of the program.In a production environment, we can further increase execution speed by choosing multi-process and multi-threaded libraries for CPU-intensive tasks and IO-intensive tasks, respectively -- which is also a good solution to deadlock problems.In addition, since the map function does not support manual thread management, it makes the debug work very simple.

Here we have parallelised (basically) through one line of Python.

Posted by azaidi7 on Fri, 26 Apr 2019 11:18:35 -0700