Multi machine and multi GPU distributed computing

Keywords: Python Distribution

Question: I have encountered a requirement before. Assuming that there are multiple computing nodes and multiple GPU s on multiple computing nodes need to be used to deal with the same huge task, how to build such a cluster in the python environment?

The first thought is to use MPI to carry out message transmission between parallel nodes, and decompose a huge computing task into multiple small tasks, so that multiple computing nodes on the cluster can process part of the task.

For information on setting up mpi clusters, please refer to this blog post: Setting up mpi cluster environment_ secyb blog - CSDN blog_ mpi environment configuration

You can view the mpi Library in python: Introduction — MPI for Python 3.1.3 documentation 

After setting up the MPI cluster, you can execute the MPI program in the cluster environment. Here, we use two nodes (node1 and node2) with two GPU s on each node for calculation. At this time, you can use the following commands to complete the execution of tasks on the cluster (each node is assigned two processes).

mpiexec -hosts node1,node2 -n 4 python3 demo.py

The following is a simple data distribution and GPU computing task we wrote: Here we use mpi. First, the Master node knows the number and dimensions of matrices to be processed. We divide the matrix to be processed equally into each computing node. Assuming that the GPU memory of each node can not accommodate all matrices, so we store the matrix to be calculated in a memory pool, Here we use a list to simulate. After processing a matrix, we get a new matrix from the list and put it into the GPU for processing. In the multi GPU function, we multiply each matrix obtained by itself. We also test the multi CPU function. We change each matrix element obtained to twice the original.

import numpy as np
import time
import random
from mpi4py import MPI
import subprocess
from multiprocessing import Pool
import socket
import copy as cp

class multi_GPU():

    def __init__(self, task_list):
        # Tasks assigned to the current node
        self.task_list = task_list[:]

    def available_GPU(self):
        '''
        get the available GPU
        :return:
        '''
        nDevice = int(subprocess.getoutput('nvidia-smi -L | grep GPU |wc -l'))
        total_GPU_str = subprocess.getoutput("nvidia-smi -q -d Memory | grep -A4 GPU | grep Total | grep -o '[0-9]\+'")
        total_GPU = total_GPU_str.split('\n')
        total_GPU = np.array([int(device_i) for device_i in total_GPU])
        avail_GPU_str = subprocess.getoutput("nvidia-smi -q -d Memory | grep -A4 GPU | grep Free | grep -o '[0-9]\+'")
        avail_GPU = avail_GPU_str.split('\n')
        avail_GPU = np.array([int(device_i) for device_i in avail_GPU])
        avail_GPU = avail_GPU / total_GPU
        return np.argmax(avail_GPU)

    def parallel_run(self):
        '''
        splitting n tasks to n GPUs
        :return:
        '''
        nDevice = int(subprocess.getoutput('nvidia-smi -L | grep GPU |wc -l'))
        # for i in range(nDevice):
        # results.append([])
        # Each process manages one GPU
        pool = Pool(nDevice)
        print('there are %d task on %s' % (len(self.task_list), socket.gethostname()))
        print('the nDevice is {}'.format(nDevice))
        num = 0
        # Start assigning tasks
        while (len(self.task_list) != 0):
            # Get the currently idle GPU
            time.sleep(random.randint(1, 4))
            device_i = self.available_GPU()
            task = self.task_list.pop()
            num += 1
            # Multiprocess calls to the run function to assign tasks to different GPUS
            pool.apply_async(func=self.run, args=(device_i, task, num))
        pool.close()
        pool.join()
        r = []
        for i in range(1, num + 1):
            #load the execution results of each GPU
            r.append(cp.asnumpy(cp.load('device/' + str(num) + '.npy')))
        return r

    def run(self, device_id, task, num):
        # Use the GPU corresponding to a
        with cp.cuda.Device(device_id):
            task = cp.array(task)
            print(device_id)
            # Performing arithmetic tasks
            mat = cp.matmul(task, task)
        # The results of each process are saved using the save function
        cp.save('device/' + str(num), mat)
        time.sleep(3)

def worker(data):
    '''
    Working function, data*2
    :param data:
    :return:
    '''
    time.sleep(random.randint(0, 3))
    return data * 2

class data_Loader():
    def __init__(self, number, dim):
        '''

        :param number:Number of matrices to be processed
        :param dim: The dimensions of the matrix need to be processed
        '''
        self.number = number
        self.dim = dim

    def data_pool(self):
        '''
        Manage read in memory numpy_array
        :param arrays: Read into memory numpy_array
        :return:
        '''
        task = []
        for i in range(number):
            #Generate a matrix of number [dim,dim] dimensions
            task.append(np.random.randn(self.dim, self.dim))
        print(np.shape(task))
        return task

    def data_cpu_reader(self):
        '''

        :return:
        '''
        task = self.data_pool()
        return task

    def data_gpu_reader(self,node_number = 2):
        '''

        :return:
        '''
        #Convert data to array form
        task = np.array(self.data_pool())
        #There are several machines (node_number) in the root distance to split the data
        #(4,2,2)->(2,2,2,2)
        task = task.reshape((node_number, -1, self.dim, self.dim))
        task = task.tolist()
        return task


class data_manager():

    def __init__(self, Loader):
        super(data_manager, self).__init__()
        self.Loader = Loader

    def to_gpu(self):
        '''
        Using multiple hosts GPU Processing tasks
        :return:
        '''
        comm = MPI.COMM_WORLD
        size = comm.Get_size()
        rank = comm.Get_rank()
        name = MPI.Get_processor_name()
        print('Hello! I am process %d of %d on %s' % (rank, size, name))
        if rank == 0:
            time1 = time.time()
            #Process data format
            data = self.Loader.data_gpu_reader()
            time2 = time.time()
            print('generate data:', time2 - time1)
            # print('data:',data)
        else:
            data = None
        #Data distribution
        local_data = comm.scatter(data, root=0)
        print('rang %d get:' % rank)
        print(local_data)
        local_data = list(local_data)
        #Transfer the data obtained by the current node to multi_GPU class
        multi = multi_GPU(local_data)
        num = multi.parallel_run()
        # local_data = worker(local_data)
        print('processed:', num)
        # Specify process 0 as the root process (also participate in the operation)
        all_sum = comm.reduce(num, root=0, op=MPI.SUM)
        if rank == 0:
            print('reduced:', all_sum)

        return all_sum

    def to_cpu(self):
        '''
        Using multiple hosts CPU Processing tasks
        :return:
        '''
        comm = MPI.COMM_WORLD
        size = comm.Get_size()
        rank = comm.Get_rank()
        name = MPI.Get_processor_name()

        print('Hello! I am process %d of %d on %s' % (rank, size, name))
        if rank == 0:
            data = self.Loader.data_cpu_reader()
        else:
            data = None
        local_data = comm.scatter(data, root=0)
        print('rang %d get:' % rank)
        local_data = worker(local_data)
        print('processed:', local_data)
        # Specify process 0 as the root process (also participate in the operation)
        all_sum = comm.reduce(local_data, root=0, op=MPI.SUM)
        if rank == 0:
            print('reduced:', all_sum)
        return all_sum


if __name__ == '__main__':

    # Data to be processed
    #Number of matrices to process
    number = 6
    #Dimensions of the matrix to be processed
    dim = 2
    instance1 = data_Loader(number, dim)
    instance2 = data_manager(instance1)
    result = instance2.to_gpu()

If we assume that the memory capacity will also be limited, data_ All the data to be calculated cannot be stored in the pool. In this case, we need to read part of the data from the storage device to the memory pool. After part of the data in the memory pool is transmitted to the GPU, we can read new data to the memory pool. Here, we can use asynchronous method to overlap data transmission and calculation.

Here we use two processes and two coroutines to complete the data_ For the dynamic update of pool, each computing node starts two processes and two co processes. Each process manages the computing task of a GPU on the current node. First, start from data_pool transfers data to the video memory. During calculation, the coprocessor switches. The coprocessor reads new data (if any) and stores the data already transferred to the video memory in data_ Location in pool, from standby_ Use the new data in the pool to overwrite. After the overwrite is completed, switch to the GPU calculation task, wait for the GPU calculation to complete, save the calculation results, and then repeat until all tasks are completed.

import random

import numpy as np
from multiprocessing import Pool, Manager
import time
import os
import asyncio

async def load_2_datapool(standby_pool,data_pool,i,q,n):
    '''
    Analog direction data_pool Incoming new data from
    :param standby_pool:spare data_pool,New data stored
    :param data_pool:
    :param i:data_pool What is the data in
    :param q:queue
    :return:
    '''
    #Read data
    pid = os.getpid()
    if len(standby_pool) != 0:
        print(f'process{pid}Read new data:{i}')
        try:
            #When new data is read, the corresponding tag is stored
            n.put(len(standby_pool) + 3)
            new_data = standby_pool.pop()
            time.sleep(1)
            # Transmit data
            print(f'process{pid}Transfer new data:{i}')
            time.sleep(1)
            data_pool[i] = new_data
            # At present, the GPU obtains the i-th data, and the coroutine has put the new data into the data_ In pool [i], now put the tag I behind the shared queue
            q.put(i)
            await asyncio.sleep(1)
        except IndexError:
            print(f'{pid}Attempt to read spare pool fail')
    else:
        print('spare data_pool It's used up!')
        return

def one_numpy_dif_process(standby_pool,data_pool,q,n):
    '''
    Use different processes to process the same numpy Different parts of a multidimensional array
    :param data_pool: Multidimensional array
    :param i:The number of elements in the array
    :return:
    '''
    #Get the label corresponding to the data that the current process should process
    #q always stores the label of the data to be processed
    while q.empty() == False:
        data_tag = q.get()
        extend_tag = n.get()
        print('process',os.getpid(),',Obtain data processing permission:',data_tag)
        print('process', os.getpid(), 'Start processing data')
        coroutine_1 = GPU_compute(data_pool[data_tag],data_tag,extend_tag)
        coroutine_2 = load_2_datapool(standby_pool,data_pool,data_tag,q,n)
        #Create an event loop
        loop = asyncio.get_event_loop()
        #asyncio.wait(tasks) accepts a task list, and the execution order is related to the task order in the list
        tasks = [
            asyncio.ensure_future(coroutine_1),
            asyncio.ensure_future(coroutine_2),
        ]
        dones, pendings = loop.run_until_complete(asyncio.wait(tasks))


import cupy as cp
async def GPU_compute(data,data_tag,extend_tag):
    '''
    Simulated use GPU Calculate
    :param data:
    :param data_tag:What data does the current collaboration process handle
    :param q:queue
    :return:
    '''
    # Transfer data from memory to video memory
    pid = os.getpid()
    print(f'{pid}\'s Data:{data}')
    #Get the currently idle GPU
    device_i = available_GPU()
    print(f'the idle GPU is {device_i}')
    with cp.cuda.Device(device_i):
        print(f'{pid}Transfer data to video memory......')
        task = cp.array(data)
        print(f'the GPU {device_i} is computing...')
        #asyncio.create_ Task - > encapsulate the collaboration as a task
        a = asyncio.create_task(cp.dot(task,task))
        # GPU is used for calculation. At this time, the coordination process can be switched
        result = await a
        #result = 1
    print('await result:',result)
    print(f'{pid} take GPU Calculated data storage.......')
    if extend_tag == 'Nan':
        num = data_tag
    else:
        num = extend_tag
    cp.save('device/' + str(num), result)
    time.sleep(1)
    return

import subprocess
def available_GPU():
    '''
    get the available GPU
    :return:
    '''
    nDevice = int(subprocess.getoutput('nvidia-smi -L | grep GPU |wc -l'))
    time.sleep(random.randint(0,1))
    total_GPU_str = subprocess.getoutput("nvidia-smi -q -d Memory | grep -A4 GPU | grep Total | grep -o '[0-9]\+'")
    total_GPU = total_GPU_str.split('\n')
    total_GPU = np.array([int(device_i) for device_i in total_GPU])
    avail_GPU_str = subprocess.getoutput("nvidia-smi -q -d Memory | grep -A4 GPU | grep Free | grep -o '[0-9]\+'")
    avail_GPU = avail_GPU_str.split('\n')
    avail_GPU = np.array([int(device_i) for device_i in avail_GPU])
    avail_GPU = avail_GPU / total_GPU
    return np.argmax(avail_GPU)

if __name__ == '__main__':
    '''
    Use two processes and two coroutines
    Each node handles 10 numpy array
    1,Using two processes will datapool fill
    2,datapool When it is filled, the calculation task begins
    3,Each process manages one GPU For the calculation task, firstly, the data is transferred to the video memory for collaborative switching during calculation
    4,The coroutine reads in new data (if any), which will have been transmitted to GPU The location of the data is overwritten. After overwriting, switch to GPU Computing task,Obtain results
    5,take GPU Calculation results dump
    '''
    #Standby data_pool, current data_ After the data in the pool is processed, the standby data_ Read data from pool_ In pool
    tmp_standby = np.random.rand(5,4,4)
    standby_pool = Manager().list()
    standby_pool.extend(tmp_standby)
    tmp = np.random.rand(5,4,4)
    data_pool = Manager().list()
    data_pool.extend(tmp)
    #Use queues to ensure mutual exclusion between process processing tasks
    q = Manager().Queue()
    n = Manager().Queue()
    #Put all task tags on the queue q
    #Data_ When the pool is overwritten, the corresponding tag is put into the queue n
    for i in range(len(data_pool)):
        q.put(i)
        n.put('Nan')
    print(data_pool)
    p = Pool(2)
    p.apply_async(func=one_numpy_dif_process,args=(standby_pool,data_pool,q,n))
    p.apply_async(func=one_numpy_dif_process,args=(standby_pool,data_pool,q,n))
    p.close()
    p.join()

Posted by FamousMortimer on Sun, 05 Dec 2021 04:09:37 -0800