No.7_4 OpenCL Synchronization - Work Item Synchronization

Keywords: Programming shell

Summary

In OpenCL, multiple work items are executed in parallel. When a work item writes data to a storage space and other work items read data from that address, how to ensure that the data read by the later work item is the data written by the previous work item? At this point, synchronization between work items is involved.

Work item synchronization

OpenCL does not define the synchronization operation of work items between working groups, and the synchronization point (barrier) can not work on part of the work items in the working group. It can only work on all work items in the working group at the same time. One work item performs a write operation on a storage space, and the other performs a read operation on the storage space. The read data may not necessarily be the data written by the previous work item. The use of storage order synchronization (fence) ensures that the writing of the previous work item is visible to all other work items. When a work item in a workgroup executes the kernel code, the execution stops until the barrier is executed, and then all work items continue to execute together after the barrier has been executed. The synchronization functions between work items in OpenCL are described below.

void barrier(cl_mem_fence_flags flags)

The parameter flags specifies the memory address space, which can be a combination of the following values:

  • CLK_LOCAL_MEM_FENCE: Function barrier will flush variables stored in local memory area or perform a memory fence operation to ensure the correct order of operation for local memory access.
  • CLK_GLOBAL_MEM_FENCE: Function barrier performs a memory fence operation to ensure correct access sequence to the global memory area. It is very useful when work items perform write operations on buffers or image objects and need to read updated data.

The barrier function contains a memory fence (read and write) operation to ensure correct read and write order for global or local memory areas. There are two points to note:

  • If barrier is in a conditional statement, then if a work item enters the conditional statement and executes the barrier function, then all work items must enter the conditional statement.
  • If barrier is in a loop statement, all work items must execute this function at each iteration of the loop (before the code behind the barrier continues to execute).

Be careful

For the examples described in this article, OpenCL programming has the following points to pay attention to.

Variable length array

OpenCL C does not support defining variable-length arrays like C99. In OpenCL, if the length of the array is uncertain, the device side will report an error at compile time. As follows:

__kernel void kernel_dot(__global int *dst, __global int *src1,  __global int *src2)
{
        int index = get_global_id(0);
        // The size value is dynamically acquired through function calls and compiles errors in OpenCL C
        int size = get_global_size(0);   // ①
        //__local int buffer[WORKGROUP_SIZE];
        __local int buffer[size];

        // Buffer Filling Completed
        buffer[index] = src1[index] * src2[index];

        // All work items are executed here. Waiting for access to local buffers to complete
        barrier(CLK_LOCAL_MEM_FENCE);

        // Get results only when the first work item is executed
        if (index == 0) {
                int sum = 0;
                //for (int i = 0; i < WORKGROUP_SIZE; i++) {
                for (int i = 0; i < size; i++) {
                        sum += buffer[i];
                }
                dst[0] = sum;
        }
}

In the above kernel code, the value of position size is obtained by calling the built-in function, and its value is uncertain before the function returns. When the device compiles the code, the following errors occur:

1|shell@HWFRD:/data/local/tmp/sync/item $ ./opencl_sync_item
build log:
<source>:7:20: error: variable length arrays are not supported in OpenCL
        __local int buffer[size];
                          ^

error: Compiler frontend failed (error code 59)

Inter-working groups

In the following kernel code, Location 1 uses _local to allocate memory in the local memory area, and the allocated data can only be shared between the same workgroup. Assuming that the workgroup size is set to 256 (the value is obtained by CL_DEVICE_MAX_WORK_GROUP_SIZE), and that the global work item size is 512 (when the clEnqueueNDRangeKernel function is called, it is passed in by parameters), it will correspond to two workgroups, where buffer s cannot be shared across workgroups. The data returned by the kernel after execution contains only the results of the first 256 work items.

__kernel void kernel_dot(__global int *dst, __global int *src1,  __global int *src2)
{
    int index = get_global_id(0);
    __local int buffer[WORKGROUP_SIZE];  // ①

    // Buffer Filling Completed
    buffer[index] = src1[index] * src2[index];

    // All work items are executed here. Waiting for access to local buffers to complete
    barrier(CLK_LOCAL_MEM_FENCE);

    // Get results only when the first work item is executed
    if (index == 0) {
        int sum = 0;
        for (int i = 0; i < WORKGROUP_SIZE; i++) {
            sum += buffer[i];
        }
        dst[0] = sum;
    }
}

Variable declaration

If the above _local int buffer[WORKGROUP_SIZE] is redefined as _global int buffer[WORKGROUP_SIZE], there will be an error in allocating memory for variables from the global memory area. When building, the following error occurs -- variables cannot be declared in the global address space. The memory allocated in the global address space corresponds to memory objects, including buffer objects, image objects and so on.

build log:
<source>:5:15: error: variable cannot be declared in global address space
        __global int buffer[WORKGROUP_SIZE];
                     ^

error: Compiler frontend failed (error code 59)

Sample program

The sample program OpenCLSyncItem A buffer is declared in the local memory area, which is the size of the workgroup. Each work item in the workgroup (identified by index) updates the buffer based on the value of the global memory area. After modification, each member of the buffer is accumulated by the first work item, and the result is returned to the host. To ensure that the data obtained by the first work item is updated, barrier is used to perform synchronization operations in the local memory area. The kernel code is as follows:

__kernel void kernel_dot(__global int *dst, __global int *src1,  __global int *src2)
{
    int index = get_global_id(0);
    __local int buffer[WORKGROUP_SIZE];

    // Buffer Filling Completed
    buffer[index] = src1[index] * src2[index];

    // All work items are executed here. Waiting for access to local buffers to complete
    barrier(CLK_LOCAL_MEM_FENCE);

    // Get results only when the first work item is executed
    if (index == 0) {
        int sum = 0;
        for (int i = 0; i < WORKGROUP_SIZE; i++) {
            sum += buffer[i];
        }
        dst[0] = sum;
    }
}

Reference resources

  • OpenCL Heterogeneous Parallel Computing: Principle, Mechanism and Optimizing Practice
  • OpenCL Reference Pages

Posted by podarum on Thu, 11 Jul 2019 15:55:20 -0700