Recently, the blogger has a crawler project in hand, and began to study python crawler development in depth. This is my blog, which is also equivalent to my learning notes. I think the first step to learn crawler is to learn python multi-threading and multi-process, be familiar with network programming, and then share with you in the form of blog.
Multi process
There are two main ways for Python to realize multi-process, one is to use fork method in os module, the other is to use multiprocess module. The difference between the two methods is that the former is only applicable to Unix/Linux operating system and does not support Windows, while the latter is a cross-platform implementation. At present, most crawlers run on Unix/Linux operating system.
1. Implementing threads by fork mode of os module
The fork method calls once and returns twice (the operating system copies the current process (parent process) to a sub-thread, which is almost identical, in which the sub-thread always returns 0, and the parent thread returns the ID of the sub-thread).
import os # getpid() gets the ID of the current thread and getppid() gets the ID of the parent thread. if __name__ == "__main__": print('The current process is(%s)' % (os.getpid())) pid = os.fork() if pid < 0: print('error in fork') elif pid == 0: print('I am a sub-thread(%s),My parent thread is(%s)', (os.getpid(), os.getppid)) else: print('I(%s)Create a sub-thread(%s).', (os.getpid(), pid))
2. Using multiprocessing module to create multithreads
The multiprocessing module provides a Process class to describe a process object. When creating a sub-process, only one parameter of the execution function and function is needed to complete the creation of a process instance. Start the process with start() method and synchronize the processes with join() method.
import os from multiprocessing import Process def run_proc(name): print('Child process %s (%s) Running...' % (name, os.getpid())) if __name__ == '__main__': print('Parent process %s.' % os.getpid()) for i in range(5): p = Process(target=run_proc, args=(str(i),)) print('Process will start.') p.start() p.join() print('Process end.')
3. The multiprocessing module provides a Pool class to represent process pool objects
Pool can provide a specified number of processes for user invocation, the default size is the CPU's core. When a new request is submitted to a Pool, if the pool is not full, a new process is created to execute the request; but if the number of processes in the pool has reached the specified maximum, the request will wait until the process in the pool ends, and a new process will not be created to process it. Here is an example to illustrate the workflow of the process pool. The code is as follows
import os,time,random from multiprocessing import Pool def run_task(name): print('Task %s (pid = %s) is running...' % (name,os.getpid())) time.sleep(random.random()*3) print('Task %s is end.' % name) if __name__ == '__main__': print('Current process is %s.' % os.getpid()) p = Pool(processes=3) for i in range(5): p.apply_async(run_task, args=(i,)) print('waiting for all subprocesses done...') p.close() p.join() print('All subprocesses done.')
The Pool object calls the join() method and waits for all the subprocesses to execute. Before calling join(), close() must be called first. After calling close(), new processes cannot be added.
IV. Interprocess Communication
Python provides a variety of ways to communicate with processes, such as Queue, Pipe, Value+Array, and so on.
Pipe is often used to communicate between two processes, and Queue is used to communicate between multiple processes.
1.Queue implementation:
from multiprocessing import Process,Queue import os,time,random # Write code for data process execution: def proc_write(p, urls): print('Process(%s)is writing,,,' % os.getpid()) for url in urls: p.put(url) print('Put %s to queue...' % url) time.sleep(random.random()) # Read the code that the data process executes: def proc_read(q): print('Process (%s) is reading...' % os.getpid()) while True: url = q.get(True) print('Get %s from queue.' % url) if __name__ == '__main__': # The parent process creates the Queue and passes it to each child thread q = Queue() proc_write1 = Process(target=proc_write, args=(q,['url_1', 'url_2', 'url_3'])) proc_write2 = Process(target=proc_write, args=(q, ['url_4', 'url_5', 'url_6'])) proc_reader = Process(target=proc_read, args=(q,)) # The starter thread proc_writer writes: proc_write1.start() proc_write2.start() # Starter threads proc_reader, read: proc_reader.start() # Waiting for proc_writer to end: proc_write1.join() proc_write2.join() # The proc_reader process is a dead loop and cannot wait for the end. It can only be forced to terminate: proc_reader.terminate()
2.Pipe implementation:
The Pipe method return (conn1, conn2) represents two ends of a pipe. The Pipe method has a duplex parameter. If the duplex parameter is True (default), then the pipeline is in full duplex mode, which means that both conn1 and conn2 can be sent and received. If duplex is False, conn1 only receives messages and conn2 only sends messages. Send and recv methods are the methods of sending and receiving messages, respectively. For example, in full duplex mode, you can call conn. send to send the message conn.recv to receive the message. If there is no message to receive, the recv method will always block. If the pipeline has been closed, the recv method throws EOFError
import multiprocessing import os,time,random def proc_send(pipe, urls): for url in urls: print('Process(%s) send:%s' % (os.getpid(), url)) pipe.send(url) time.sleep(random.random()) def proc_recv(pipe): while True: print('Process(%s) recv:%s' % (os.getpid(),pipe.recv())) time.sleep(random.random()) if __name__ == '__main__': # Call the Pipe() method and return two Conns pipe = multiprocessing.Pipe() p1 = multiprocessing.Process(target=proc_send, args=(pipe[0], ['url_'+str(i) for i in range(10)])) p2 = multiprocessing.Process(target=proc_recv, args=(pipe[1],)) p1.start() p2.start() p1.join() p2.join()
Multithreading
The standard library of Python provides two modules: threading and threading. Theading is a low-level module and threading is a high-level module, which encapsulates threads. In most cases, we only need to use threading as an advanced module.
1. Using threading module to create multithreads
The threading module generally creates multithreads in two ways: the first way is to pass a function on and create an instance of Thread, and then call the start method to start execution. The code is as follows:
import random import time, threading # Code executed by a new thread def thread_run(urls): print('Current %s is running ...' % threading.current_thread().name) for url in urls: print('%s --->>> %s' % (threading.current_thread().name, url)) time.sleep(random.random()) print('%s ended.' % threading.current_thread().name) print('%s is running...' % threading.current_thread().name) t1 = threading.Thread(target=thread_run, name='Thread_1', args=(['url_1', 'url_2', 'url_3'],)) t2 = threading.Thread(target=thread_run, name='Thread_2', args=(['url_4', 'url_5', 'url_6'],)) t1.start() t2.start() t1.join() t2.join() print('%s ended.' % threading.current_thread().name)
The second way is to inherit and create thread classes directly from threading.Thread, and then override the init and run methods.
The code is as follows:
import random import threading import time class myThread(threading.Thread): def __init__(self, name, urls): threading.Thread.__init__(self, name=name) self.urls = urls def run(self): print('Current %s is running ...' % threading.current_thread().name) for url in self.urls: print('%s --->>> %s' % (threading.current_thread().name, url)) time.sleep(random.random()) print('%s ended ...' % threading.current_thread().name) print('%s is running...' % threading.current_thread().name) t1 = myThread(name='Thread_1', urls=['url_1', 'url_2', 'url_3']) t2 = myThread(name='Thread_2', urls=['url_4', 'url_5', 'url_6']) t1.start() t2.start() t1.join() t2.join() print('%s ended.' % threading.current_thread().name)
2. Thread synchronization
To ensure the correctness of the data, multiple threads need to be synchronized, which requires calling the Lock and RLock objects of Thread.
Both objects have acquire and release methods, which can be placed between acquire and release methods for data that allows only one thread operation at a time.
For Lock objects, if a thread acquires twice in a row, the second acquire suspends the thread because there is no release later. This causes the Lock object to never release, causing the thread to deadlock. The RLock object allows a thread to acquire multiple times, because the number of acquisitions of threads is maintained internally through a counter variable. And each acquisition operation must have a release operation corresponding to all releases before other threads can apply for the RLock object. Thread synchronization demo code is as follows:
import threading mylock = threading.RLock() num = 0 class myThread(threading.Thread): def __init__(self, name): threading.Thread.__init__(self, name=name) def run(self): global num while True: mylock.acquire() print('%s locked,Number:%d' % (threading.current_thread().name, num)) if num >= 4: mylock.release() print('%s released,Number:%d' % (threading.current_thread().name, num)) break num += 1 print('%s released,Number:%d' % (threading.current_thread().name, num)) mylock.release() if __name__ == '__main__': thread1 = myThread('Thread1') thread2 = myThread('Thread2') thread1.start() thread2.start()