Hi, I am writing a very simple kernel that does Bayer interpolation. I cannot get this kernel to run any faster than 2000 ticks in the CUDA profiler report, but I believe it should be much quicker if I had a better understanding of CUDA, and what parts of this are going slow.
Basically, the kernel takes a float array (global mem or texture, didnt seem to make much speed difference) where each pixel is:
Each pixel in the ouput array gets 3 values, one for each of red/blue/green, averaged from all measurements of that color in a 3x3 window around the pixel.
The naive approach is to have the kernel operate on each pixel independently, search a 3x3 window with if statements to determine which elements to add together. This gives me a score of 2300 in the profiler.
My next step was to load local blocks into shared memory, because they are accessed by neighboring pixels, so I thought this would be faster. This only reduced it to 2000. What am I doing that is slow here? I feel like this should be a very simple thing for the GPU to do.
Here is my psuedo code:
BLOCKDIM = 16 dim3 threads(BLOCKDIM, BLOCKDIM); dim3 grid(iDivUp(size.x, BLOCKDIM_X), iDivUp(size.y, BLOCKDIM_Y)); bayerKernel<<<grid, threads>>>(dst,src,size) CUDA_SAFE_CALL( cudaThreadSynchronize() );
shared LocalBlock[(BLOCKDIM+2)*(BLOCKDIM+2)] //+2 because of the 1 pixel apron around the block that needs to also be loaded each pixel loads src[pos] into LocalBlock pixels around the border also load their neighbor outside the block into LocalBlock __syncthreads(); if in a red center dst[pos].r = src[pos] dst[pos].g = sum of 4 green neighbors / 4 dst[pos].b = sum of 4 blue neighbors / 4 else if in a blue center dst[pos].r = sum of 4 red neighbors / 4 dst[pos].g = sum of 4 green neighbors / 4 dst[pos].b = src[pos] else if in a green center dst[pos].r = sum of 2 red neighbors / 2 dst[pos].g = src[pos] dst[pos].b = sum of 2 blue neighbors / 2
Any help would be appreciated. Thanks!