Cuda demosaicing running slowly

I’m trying to convert a raw Bayer camera image to RGB using Cuda 4 on a 16core GTS250

Based on the SDK sample code in “bilinear” Blocksize is 16x16, image is a multiple of 16 pixels.

But it’s taking 10ms to run, longer than the similar code takes single-threaded on the CPU!

Am I doing something obviously stupid?

texture<uchar, 2, cudaReadModeElementType> tex;

cudaArray *d_imageArray = 0;

__global__ void convertGRBG(uchar4 *d_output, uint width, uint height)


    uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

    uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

    uint i = __umul24(y, width) + x;

// input is GR/BG output is BGRA

    if ((x < width) && (y < height)) {

if ( y & 0x01 ) {

            if ( x & 0x01 ) {  

                d_output[i].x =  (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2;  // B                

                d_output[i].y = (tex2D(tex,x,y));     // G in B

                d_output[i].z = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2;  // R                    

            } else {

                d_output[i].x = (tex2D(tex,x,y));        //B

                d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4;  // G

                d_output[i].z = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4;   // R


        } else {

            if ( x & 0x01 ) {

                 // odd col = R

                d_output[i].y = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4;  // B

                d_output[i].z = (tex2D(tex,x,y));        //R

                d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4;  // G    

            } else {    

                d_output[i].x = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2;  // B

                d_output[i].y = (tex2D(tex,x,y));               // G  in R               

                d_output[i].z = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2;  // R                    





void initTexture(int imageWidth, int imageHeight, uchar *imagedata)


cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);

    cutilSafeCall( cudaMallocArray(&d_imageArray, &channelDesc, imageWidth, imageHeight) ); 

    uint size = imageWidth * imageHeight * sizeof(uchar);

    cutilSafeCall( cudaMemcpyToArray(d_imageArray, 0, 0, imagedata, size, cudaMemcpyHostToDevice) );


// bind array to texture reference with point sampling

    tex.addressMode[0] = cudaAddressModeClamp;

    tex.addressMode[1] = cudaAddressModeClamp;

    tex.filterMode = cudaFilterModePoint;

    tex.normalized = false; 

cutilSafeCall( cudaBindTextureToArray(tex, d_imageArray) );


The [font=“Courier New”]if[/font] conditions in this code totally kill performance as at any time only a quarter of the threads will be active at any time. Avoid this by calculating four neighboring pixels per thread.

I’d guess a web search will bring up more optimized codes.

Is that necessarily true? Each input pixel is processed at one point in the loop - it’s just determining which 1 of 4 in the pattern we are. Or do 'if’s automatically spawn a thread?

I wasn’t sure how to process 4 pixels per thread - how do I arrange that the kernel is only called for every 1/4 input pixels?

Do I simply skip 3 out of 4 threadIdx or is there a way of using the blocksize/gridsize to arrange this?

Surprisingly not!

There are some examples of very advanced demosaicing codes which would be completely impractical without CUDA. I would have thought there was a clever trick using texture hardware to essentially rescale the 1 in 4 red pixels to the entire image size but I can’t work it out.

But generally it’s not worth going to the GPU for a simple case, SSE2 is enough - it’s just that this is the last step of a series of operations.

Don’t do this, as it will lead to exactly the same problem: The skipped threads leave holes in your block of threads, which occupy resources but produce no results.

Assuming even pixel dimensions, you could call this kernel with a grid of only half the x and y size:

__global__ void convertGRBG(uchar4 *d_output, uint width, uint height)


    uint x = 2 * (__umul24(blockIdx.x, blockDim.x) + threadIdx.x);

    uint y = 2 * (__umul24(blockIdx.y, blockDim.y) + threadIdx.y);

    uint i = __umul24(y, width) + x;

// input is GR/BG output is BGRA

    if ((x < width-1) && (y < height-1)) {

        // x+1, y+1:

        d_output[i+width+1].x = make_uchar4( (tex2D(tex,x+2,y+1)+tex2D(tex,x,y+1))/2,  // B                

                                             (tex2D(tex,x+1,y+1)),     // G in B

                                             (tex2D(tex,x+1,y+2)+tex2D(tex,x+1,y))/2,  // R                    


        // x, y+1:

        d_output[i+width].x =   make_uchar4( (tex2D(tex,x,y+1)),        //B

                                             (tex2D(tex,x+1,y+1) + tex2D(tex,x-1,y+1)+tex2D(tex,x,y+2)+tex2D(tex,x,y))/4,  // G

                                             (tex2D(tex,x+1,y+2) + tex2D(tex,x+1,y)+tex2D(tex,x-1,y+2)+tex2D(tex,x-1,y))/4,   // R


        // x+1, y:

        d_output[i+1].y =       make_uchar4( (tex2D(tex,x+2,y+1) + tex2D(tex,x+2,y-1)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4;  // B

                                             (tex2D(tex,x+1,y)),        //R

                                             (tex2D(tex,x+2,y) + tex2D(tex,x,y)+tex2D(tex,x+1,y+1)+tex2D(tex,x+1,y-1))/4,  // G


        // x, y:

        d_output[i].x =         make_uchar4( (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2,  // B

                                             (tex2D(tex,x,y)),               // G  in R               

                                             (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2,  // R                    




To optimize the bandwidth further, I’ve assumed that the unused .w channel does not need to be preserved.

OK, maybe it’s not an interesting enough optimization problem, as it is entirely memory bandwidth bound and the above code should take you “almost there”. And as a bandwidth bound problem with a “read once, write once” characteristic it’s only worth doing on the GPU if the data is already there.

That would be easy to achieve if the colors were stored in separate textures. Otherwise I don’t see how the gaps between same-color pixels would allow this trick (see post below).

Maybe you could rearrange the output of the previous step to produce this format. Then again, if you were to change the previous step, integrating the Bayer demosaicing into that would entirely eliminate the bandwidth need of the demosaicing.

Yes. If the data weren’t already on the GPU, it would be impossible to make up for the time spent on piping the data through PCIe.

Doh - yes of course it is as simple as that!
I had the gridsize generated from the image size as a side effect of reading the raw data so it was also used to allocate the textures, hadn’t occurred to me that the grid doesn’t need to be the texture_dimensions/blocksize.

Ironically the original code wasn’t that slow - the cuda timing function returns millisecs on Windows not seconds.
The 4-in-1 replacement is exactly the same speed, but much neater.

Thank you

I don’t see how to solve that one, but you can let the texture unit do the average of two horizontal or vertical pixels by using floating point coordinates halfway between two pixels. To bridge the vertical gap, you can just redefine the texture with half as many rows and twice as many columns. Bridging the horizontal gap requires using uchar2 as the texture type.

Given that this kernel is (texture) bandwidth limited and has plenty of spare arithmetic capacity, about the same speed can probably achieved by just reusing previous values from texture interpolations with identical coordinates in the above kernel.

Since the red (or blue) pixels in the Bayer mask are essentially a 1/2 resolution version of the image I thought you could use textures to generate a full size image using the hardware bilinear texture unit.

But this would mean copying all the reds into a contiguous array and then re-shuffling the resulting red/green/blue planes back into a single RBG(A) image - which would take longer than the CUDA code above.

I just wondered if there was a clever trick using some stepping parameter to do this in one pass - that everyone else knew about except me!

That is what I was talking about in my previous post. Instead of reshuffling the data, you use the tricks mentioned above to get the texture unit to interpolate between pixels that are not adjacent.

I do not expect a speedup from this though, as arithmetic operations are not a limiting factor in the above kernel (memory bandwidth is).

More is probably gained from aggregating the two writes to horizontally adjacent pixels into one write of a 64bit data type.