Question: I have encountered a requirement before. Assuming that there are multiple computing nodes and multiple GPU s on multiple computing nodes need to be used to deal with the same huge task, how to build such a cluster in the python environment?
The first thought is to use MPI to carry out message transmission between parallel nodes, and decompose a huge computing task into multiple small tasks, so that multiple computing nodes on the cluster can process part of the task.
For information on setting up mpi clusters, please refer to this blog post: Setting up mpi cluster environment_ secyb blog - CSDN blog_ mpi environment configuration
You can view the mpi Library in python: Introduction — MPI for Python 3.1.3 documentation
After setting up the MPI cluster, you can execute the MPI program in the cluster environment. Here, we use two nodes (node1 and node2) with two GPU s on each node for calculation. At this time, you can use the following commands to complete the execution of tasks on the cluster (each node is assigned two processes).
mpiexec -hosts node1,node2 -n 4 python3 demo.py
The following is a simple data distribution and GPU computing task we wrote: Here we use mpi. First, the Master node knows the number and dimensions of matrices to be processed. We divide the matrix to be processed equally into each computing node. Assuming that the GPU memory of each node can not accommodate all matrices, so we store the matrix to be calculated in a memory pool, Here we use a list to simulate. After processing a matrix, we get a new matrix from the list and put it into the GPU for processing. In the multi GPU function, we multiply each matrix obtained by itself. We also test the multi CPU function. We change each matrix element obtained to twice the original.
import numpy as np import time import random from mpi4py import MPI import subprocess from multiprocessing import Pool import socket import copy as cp class multi_GPU(): def __init__(self, task_list): # Tasks assigned to the current node self.task_list = task_list[:] def available_GPU(self): ''' get the available GPU :return: ''' nDevice = int(subprocess.getoutput('nvidia-smi -L | grep GPU |wc -l')) total_GPU_str = subprocess.getoutput("nvidia-smi -q -d Memory | grep -A4 GPU | grep Total | grep -o '[0-9]\+'") total_GPU = total_GPU_str.split('\n') total_GPU = np.array([int(device_i) for device_i in total_GPU]) avail_GPU_str = subprocess.getoutput("nvidia-smi -q -d Memory | grep -A4 GPU | grep Free | grep -o '[0-9]\+'") avail_GPU = avail_GPU_str.split('\n') avail_GPU = np.array([int(device_i) for device_i in avail_GPU]) avail_GPU = avail_GPU / total_GPU return np.argmax(avail_GPU) def parallel_run(self): ''' splitting n tasks to n GPUs :return: ''' nDevice = int(subprocess.getoutput('nvidia-smi -L | grep GPU |wc -l')) # for i in range(nDevice): # results.append([]) # Each process manages one GPU pool = Pool(nDevice) print('there are %d task on %s' % (len(self.task_list), socket.gethostname())) print('the nDevice is {}'.format(nDevice)) num = 0 # Start assigning tasks while (len(self.task_list) != 0): # Get the currently idle GPU time.sleep(random.randint(1, 4)) device_i = self.available_GPU() task = self.task_list.pop() num += 1 # Multiprocess calls to the run function to assign tasks to different GPUS pool.apply_async(func=self.run, args=(device_i, task, num)) pool.close() pool.join() r = [] for i in range(1, num + 1): #load the execution results of each GPU r.append(cp.asnumpy(cp.load('device/' + str(num) + '.npy'))) return r def run(self, device_id, task, num): # Use the GPU corresponding to a with cp.cuda.Device(device_id): task = cp.array(task) print(device_id) # Performing arithmetic tasks mat = cp.matmul(task, task) # The results of each process are saved using the save function cp.save('device/' + str(num), mat) time.sleep(3) def worker(data): ''' Working function, data*2 :param data: :return: ''' time.sleep(random.randint(0, 3)) return data * 2 class data_Loader(): def __init__(self, number, dim): ''' :param number:Number of matrices to be processed :param dim: The dimensions of the matrix need to be processed ''' self.number = number self.dim = dim def data_pool(self): ''' Manage read in memory numpy_array :param arrays: Read into memory numpy_array :return: ''' task = [] for i in range(number): #Generate a matrix of number [dim,dim] dimensions task.append(np.random.randn(self.dim, self.dim)) print(np.shape(task)) return task def data_cpu_reader(self): ''' :return: ''' task = self.data_pool() return task def data_gpu_reader(self,node_number = 2): ''' :return: ''' #Convert data to array form task = np.array(self.data_pool()) #There are several machines (node_number) in the root distance to split the data #(4,2,2)->(2,2,2,2) task = task.reshape((node_number, -1, self.dim, self.dim)) task = task.tolist() return task class data_manager(): def __init__(self, Loader): super(data_manager, self).__init__() self.Loader = Loader def to_gpu(self): ''' Using multiple hosts GPU Processing tasks :return: ''' comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() name = MPI.Get_processor_name() print('Hello! I am process %d of %d on %s' % (rank, size, name)) if rank == 0: time1 = time.time() #Process data format data = self.Loader.data_gpu_reader() time2 = time.time() print('generate data:', time2 - time1) # print('data:',data) else: data = None #Data distribution local_data = comm.scatter(data, root=0) print('rang %d get:' % rank) print(local_data) local_data = list(local_data) #Transfer the data obtained by the current node to multi_GPU class multi = multi_GPU(local_data) num = multi.parallel_run() # local_data = worker(local_data) print('processed:', num) # Specify process 0 as the root process (also participate in the operation) all_sum = comm.reduce(num, root=0, op=MPI.SUM) if rank == 0: print('reduced:', all_sum) return all_sum def to_cpu(self): ''' Using multiple hosts CPU Processing tasks :return: ''' comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() name = MPI.Get_processor_name() print('Hello! I am process %d of %d on %s' % (rank, size, name)) if rank == 0: data = self.Loader.data_cpu_reader() else: data = None local_data = comm.scatter(data, root=0) print('rang %d get:' % rank) local_data = worker(local_data) print('processed:', local_data) # Specify process 0 as the root process (also participate in the operation) all_sum = comm.reduce(local_data, root=0, op=MPI.SUM) if rank == 0: print('reduced:', all_sum) return all_sum if __name__ == '__main__': # Data to be processed #Number of matrices to process number = 6 #Dimensions of the matrix to be processed dim = 2 instance1 = data_Loader(number, dim) instance2 = data_manager(instance1) result = instance2.to_gpu()
If we assume that the memory capacity will also be limited, data_ All the data to be calculated cannot be stored in the pool. In this case, we need to read part of the data from the storage device to the memory pool. After part of the data in the memory pool is transmitted to the GPU, we can read new data to the memory pool. Here, we can use asynchronous method to overlap data transmission and calculation.
Here we use two processes and two coroutines to complete the data_ For the dynamic update of pool, each computing node starts two processes and two co processes. Each process manages the computing task of a GPU on the current node. First, start from data_pool transfers data to the video memory. During calculation, the coprocessor switches. The coprocessor reads new data (if any) and stores the data already transferred to the video memory in data_ Location in pool, from standby_ Use the new data in the pool to overwrite. After the overwrite is completed, switch to the GPU calculation task, wait for the GPU calculation to complete, save the calculation results, and then repeat until all tasks are completed.
import random import numpy as np from multiprocessing import Pool, Manager import time import os import asyncio async def load_2_datapool(standby_pool,data_pool,i,q,n): ''' Analog direction data_pool Incoming new data from :param standby_pool:spare data_pool,New data stored :param data_pool: :param i:data_pool What is the data in :param q:queue :return: ''' #Read data pid = os.getpid() if len(standby_pool) != 0: print(f'process{pid}Read new data:{i}') try: #When new data is read, the corresponding tag is stored n.put(len(standby_pool) + 3) new_data = standby_pool.pop() time.sleep(1) # Transmit data print(f'process{pid}Transfer new data:{i}') time.sleep(1) data_pool[i] = new_data # At present, the GPU obtains the i-th data, and the coroutine has put the new data into the data_ In pool [i], now put the tag I behind the shared queue q.put(i) await asyncio.sleep(1) except IndexError: print(f'{pid}Attempt to read spare pool fail') else: print('spare data_pool It's used up!') return def one_numpy_dif_process(standby_pool,data_pool,q,n): ''' Use different processes to process the same numpy Different parts of a multidimensional array :param data_pool: Multidimensional array :param i:The number of elements in the array :return: ''' #Get the label corresponding to the data that the current process should process #q always stores the label of the data to be processed while q.empty() == False: data_tag = q.get() extend_tag = n.get() print('process',os.getpid(),',Obtain data processing permission:',data_tag) print('process', os.getpid(), 'Start processing data') coroutine_1 = GPU_compute(data_pool[data_tag],data_tag,extend_tag) coroutine_2 = load_2_datapool(standby_pool,data_pool,data_tag,q,n) #Create an event loop loop = asyncio.get_event_loop() #asyncio.wait(tasks) accepts a task list, and the execution order is related to the task order in the list tasks = [ asyncio.ensure_future(coroutine_1), asyncio.ensure_future(coroutine_2), ] dones, pendings = loop.run_until_complete(asyncio.wait(tasks)) import cupy as cp async def GPU_compute(data,data_tag,extend_tag): ''' Simulated use GPU Calculate :param data: :param data_tag:What data does the current collaboration process handle :param q:queue :return: ''' # Transfer data from memory to video memory pid = os.getpid() print(f'{pid}\'s Data:{data}') #Get the currently idle GPU device_i = available_GPU() print(f'the idle GPU is {device_i}') with cp.cuda.Device(device_i): print(f'{pid}Transfer data to video memory......') task = cp.array(data) print(f'the GPU {device_i} is computing...') #asyncio.create_ Task - > encapsulate the collaboration as a task a = asyncio.create_task(cp.dot(task,task)) # GPU is used for calculation. At this time, the coordination process can be switched result = await a #result = 1 print('await result:',result) print(f'{pid} take GPU Calculated data storage.......') if extend_tag == 'Nan': num = data_tag else: num = extend_tag cp.save('device/' + str(num), result) time.sleep(1) return import subprocess def available_GPU(): ''' get the available GPU :return: ''' nDevice = int(subprocess.getoutput('nvidia-smi -L | grep GPU |wc -l')) time.sleep(random.randint(0,1)) total_GPU_str = subprocess.getoutput("nvidia-smi -q -d Memory | grep -A4 GPU | grep Total | grep -o '[0-9]\+'") total_GPU = total_GPU_str.split('\n') total_GPU = np.array([int(device_i) for device_i in total_GPU]) avail_GPU_str = subprocess.getoutput("nvidia-smi -q -d Memory | grep -A4 GPU | grep Free | grep -o '[0-9]\+'") avail_GPU = avail_GPU_str.split('\n') avail_GPU = np.array([int(device_i) for device_i in avail_GPU]) avail_GPU = avail_GPU / total_GPU return np.argmax(avail_GPU) if __name__ == '__main__': ''' Use two processes and two coroutines Each node handles 10 numpy array 1,Using two processes will datapool fill 2,datapool When it is filled, the calculation task begins 3,Each process manages one GPU For the calculation task, firstly, the data is transferred to the video memory for collaborative switching during calculation 4,The coroutine reads in new data (if any), which will have been transmitted to GPU The location of the data is overwritten. After overwriting, switch to GPU Computing task,Obtain results 5,take GPU Calculation results dump ''' #Standby data_pool, current data_ After the data in the pool is processed, the standby data_ Read data from pool_ In pool tmp_standby = np.random.rand(5,4,4) standby_pool = Manager().list() standby_pool.extend(tmp_standby) tmp = np.random.rand(5,4,4) data_pool = Manager().list() data_pool.extend(tmp) #Use queues to ensure mutual exclusion between process processing tasks q = Manager().Queue() n = Manager().Queue() #Put all task tags on the queue q #Data_ When the pool is overwritten, the corresponding tag is put into the queue n for i in range(len(data_pool)): q.put(i) n.put('Nan') print(data_pool) p = Pool(2) p.apply_async(func=one_numpy_dif_process,args=(standby_pool,data_pool,q,n)) p.apply_async(func=one_numpy_dif_process,args=(standby_pool,data_pool,q,n)) p.close() p.join()