I wanted to optimize the last blog, but the effect was not obvious. But remember the knowledge points.
The original intention is to move the calculation of domain defined in the previous blog to CPU, because the calculation of domain defined is the same for every kernel, so direct reading should further reduce the execution time of the kernel.
The original intention of my idea is to send this piece of data to the display memory and then to the register. The time to read from the register should be very fast, by which the time to calculate should be changed to the time to read. Of course, it should be questioned whether reading registers takes less time than computing, but for more complex calculations, I think direct reading should be faster than computing. For this part of the data, CPU computing should be faster than GPU. Of course, we should also consider the size of the data volume. It takes time to move from memory to display memory.
1.C++ code
.................. int ksize = 11; float sigma_d = 3.0; float *dkl = new float[ksize*ksize]; for (int i = -ksize/2; i <= ksize/2; i++){ for (int j = -ksize/2; j <= ksize/2; j++){ dkl[(i+ksize/2)*ksize + (j+ksize/2)] = -(i*i + j*j) / (2 * sigma_d*sigma_d); } } cl_mem d_dkl; d_dkl = clCreateBuffer(context, CL_MEM_READ_ONLY, ksize*ksize*sizeof(float), NULL,NULL); clEnqueueWriteBuffer(commandQueue, d_dkl, CL_TRUE, 0, ksize*ksize*sizeof(float), dkl, 0, NULL, NULL); ........................ errNum |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_dkl); errNum |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &ksize); ........................ delete[] dkl; ...................
This is mainly the use of clCreateBuffer function and clEnqueueWriteBuffer function.
2.kernel code
const sampler_t sampler = CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST; kernel void bilateralBlur(read_only image2d_t src, write_only image2d_t dst, __constant float* dkl, int ksize) { int x = (int)get_global_id(0); int y = (int)get_global_id(1); if (x >= get_image_width(src) || y >= get_image_height(src)) return; float sigma_d = 3.0; float sigma_r = 0.1; float4 fij = read_imagef(src, sampler, (int2)(x, y)); float alpha = 0.2; float4 fkl; float4 rkl; float4 wkl; int index = 0; float4 numerator = (float4)(0.0f,0.0f,0.0f,0.0f); float4 denominator = (float4)(1.0f, 1.0f, 1.0f, 1.0f); for (int K = -ksize / 2; K <= ksize / 2; K++) { for (int L = -ksize / 2; L <= ksize / 2; L++) { fkl = read_imagef(src, sampler, (int2)(x + K, y + L)); rkl.x = -(fij.x - fkl.x)*(fij.x - fkl.x) / (2 * sigma_r*sigma_r); rkl.y = -(fij.y - fkl.y)*(fij.y - fkl.y) / (2 * sigma_r*sigma_r); rkl.z = -(fij.z - fkl.z)*(fij.z - fkl.z) / (2 * sigma_r*sigma_r); wkl.x = exp(-dkl[index] + rkl.x); wkl.y = exp(-dkl[index] + rkl.y); wkl.z = exp(-dkl[index] + rkl.z); index++; numerator.x += fkl.x * wkl.x; numerator.y += fkl.y * wkl.y; numerator.z += fkl.z * wkl.z; denominator.x += wkl.x; denominator.y += wkl.y; denominator.z += wkl.z; } } float4 gij = (float4)(0.0f, 0.0f, 0.0f, 1.0f); if (denominator.x > 0 && denominator.y > 0 && denominator.z) { gij.x = numerator.x / denominator.x; gij.y = numerator.y / denominator.y; gij.z = numerator.z / denominator.z; gij.x = fij.x*alpha + gij.x*(1.0 - alpha); gij.y = fij.y*alpha + gij.y*(1.0 - alpha); gij.z = fij.z*alpha + gij.z*(1.0 - alpha); } write_imagef(dst, (int2)(x, y), gij); }
Compared with the code of the previous blog, the main thing is to change the calculation of dkl to read, and ksize is also passed in through parameters.
3. results
Compared with the previous 3.42ms, there is a few milliseconds of optimization. However, considering CPU calculations, the optimization should be smaller, either not, or slightly worse.
Of course, my calculation here is simple, for complex calculations, we should still consider this optimization method.
The next step is to consider memory optimization and increase granularity.