Hi, I am writing a very simple kernel that does Bayer interpolation. I cannot get this kernel to run any faster than 2000 ticks in the CUDA profiler report, but I believe it should be much quicker if I had a better understanding of CUDA, and what parts of this are going slow.
Basically, the kernel takes a float array (global mem or texture, didnt seem to make much speed difference) where each pixel is:
GRGRGR…
BGBGBG…
GRGRGR…
…
Each pixel in the ouput array gets 3 values, one for each of red/blue/green, averaged from all measurements of that color in a 3x3 window around the pixel.
The naive approach is to have the kernel operate on each pixel independently, search a 3x3 window with if statements to determine which elements to add together. This gives me a score of 2300 in the profiler.
My next step was to load local blocks into shared memory, because they are accessed by neighboring pixels, so I thought this would be faster. This only reduced it to 2000. What am I doing that is slow here? I feel like this should be a very simple thing for the GPU to do.
Here is my psuedo code:
calling function:
BLOCKDIM = 16
dim3 threads(BLOCKDIM, BLOCKDIM);
dim3 grid(iDivUp(size.x, BLOCKDIM_X), iDivUp(size.y, BLOCKDIM_Y));
bayerKernel<<<grid, threads>>>(dst,src,size)
CUDA_SAFE_CALL( cudaThreadSynchronize() );
bayerKernel:
shared LocalBlock[(BLOCKDIM+2)*(BLOCKDIM+2)] //+2 because of the 1 pixel apron around the block that needs to also be loaded
each pixel loads src[pos] into LocalBlock
pixels around the border also load their neighbor outside the block into LocalBlock
__syncthreads();
if in a red center
dst[pos].r = src[pos]
dst[pos].g = sum of 4 green neighbors / 4
dst[pos].b = sum of 4 blue neighbors / 4
else if in a blue center
dst[pos].r = sum of 4 red neighbors / 4
dst[pos].g = sum of 4 green neighbors / 4
dst[pos].b = src[pos]
else if in a green center
dst[pos].r = sum of 2 red neighbors / 2
dst[pos].g = src[pos]
dst[pos].b = sum of 2 blue neighbors / 2
Any help would be appreciated. Thanks!
Oliver