q: simple optimization question

Hi, I am writing a very simple kernel that does Bayer interpolation. I cannot get this kernel to run any faster than 2000 ticks in the CUDA profiler report, but I believe it should be much quicker if I had a better understanding of CUDA, and what parts of this are going slow.

Basically, the kernel takes a float array (global mem or texture, didnt seem to make much speed difference) where each pixel is:


Each pixel in the ouput array gets 3 values, one for each of red/blue/green, averaged from all measurements of that color in a 3x3 window around the pixel.

The naive approach is to have the kernel operate on each pixel independently, search a 3x3 window with if statements to determine which elements to add together. This gives me a score of 2300 in the profiler.

My next step was to load local blocks into shared memory, because they are accessed by neighboring pixels, so I thought this would be faster. This only reduced it to 2000. What am I doing that is slow here? I feel like this should be a very simple thing for the GPU to do.

Here is my psuedo code:

calling function:

    BLOCKDIM = 16
dim3 threads(BLOCKDIM, BLOCKDIM);
dim3 grid(iDivUp(size.x, BLOCKDIM_X), iDivUp(size.y, BLOCKDIM_Y));
bayerKernel<<<grid, threads>>>(dst,src,size)
CUDA_SAFE_CALL( cudaThreadSynchronize() );	


    shared LocalBlock[(BLOCKDIM+2)*(BLOCKDIM+2)] //+2 because of the 1 pixel apron around the block that needs to also be loaded
    each pixel loads src[pos] into LocalBlock
    pixels around the border also load their neighbor outside the block into LocalBlock


    if in a red center
       dst[pos].r = src[pos]
       dst[pos].g = sum of 4 green neighbors / 4
       dst[pos].b = sum of 4 blue neighbors / 4
    else if in a blue center
       dst[pos].r = sum of 4 red neighbors / 4
       dst[pos].g = sum of 4 green neighbors / 4
       dst[pos].b = src[pos]
    else if in a green center
       dst[pos].r = sum of 2 red neighbors / 2
       dst[pos].g = src[pos]
       dst[pos].b = sum of 2 blue neighbors / 2        

Any help would be appreciated. Thanks!


You probably have a lot of bank conflicts, and the big if() that you surround your code with causes divergence.

A simple approach that won’t cause bank conflicts but will still have divergence:

__shared__ char pixels[];

... load into smem ... // make sure this is coalesced and without bank-conflicts. this is tricky since the char datatype is inherently prone to causing conflicts. recast your pointers as int* for the purpose of copying. Ie, a thread should copy 4 bytes in a single instruction, not 1.

val = pixel to the upper-left // this is guaranteed to not cause bank-conflicts without any tricks

switch(my kind of pixel)

   red: blue += val/4

   blue: red += val/4

   green: do nothing

val = pixel directly above

switch(my kind of pixel)

   red: green += val/4

   blue: green += val/4

   green: if() red += val/2 else blue += val/2

... etc ...

The fastest approach that won’t cause either kind of conflict would be to process a symmetric block of pixels inside one thread. Ie, since your pattern repeats as a 2x2 block, that’s what you should process inside one thread.

__shared__ char pixelsOfTheBlock[1156];

... read (32+2)x(32x2) pixels into shared memory ... // see notes above

/* reading from an in-register array is faster and can't cause any conflicts */

char pixelsOfTheThread[16];

... read (2+2)x(2+2) pixels into the in-register array ... // again, the chars make it tricky to not have bank conflicts, but it's possible

... have your code here. there won't be any if()s or bank conflicts ...

When using in-register arrays, add “–ptxas-options=-v” to the nvcc command line to check if the compiler is trying to use harmful local memory.