Cuda demosaicing running slowly

mbeckett · November 9, 2011, 10:27pm

I’m trying to convert a raw Bayer camera image to RGB using Cuda 4 on a 16core GTS250

Based on the SDK sample code in “bilinear” Blocksize is 16x16, image is a multiple of 16 pixels.

But it’s taking 10ms to run, longer than the similar code takes single-threaded on the CPU!

Am I doing something obviously stupid?

texture<uchar, 2, cudaReadModeElementType> tex;

cudaArray *d_imageArray = 0;

__global__ void convertGRBG(uchar4 *d_output, uint width, uint height)

{

    uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

    uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

    uint i = __umul24(y, width) + x;

// input is GR/BG output is BGRA

    if ((x < width) && (y < height)) {

if ( y & 0x01 ) {

            if ( x & 0x01 ) {  

                d_output[i].x =  (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2;  // B                

                d_output[i].y = (tex2D(tex,x,y));     // G in B

                d_output[i].z = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2;  // R                    

            } else {

                d_output[i].x = (tex2D(tex,x,y));        //B

                d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4;  // G

                d_output[i].z = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4;   // R

            }

        } else {

            if ( x & 0x01 ) {

                 // odd col = R

                d_output[i].y = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4;  // B

                d_output[i].z = (tex2D(tex,x,y));        //R

                d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4;  // G    

            } else {    

                d_output[i].x = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2;  // B

                d_output[i].y = (tex2D(tex,x,y));               // G  in R               

                d_output[i].z = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2;  // R                    

            }

        }                                

    }

}

void initTexture(int imageWidth, int imageHeight, uchar *imagedata)

{

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);

    cutilSafeCall( cudaMallocArray(&d_imageArray, &channelDesc, imageWidth, imageHeight) ); 

    uint size = imageWidth * imageHeight * sizeof(uchar);

    cutilSafeCall( cudaMemcpyToArray(d_imageArray, 0, 0, imagedata, size, cudaMemcpyHostToDevice) );

    cutFree(imagedata);

// bind array to texture reference with point sampling

    tex.addressMode[0] = cudaAddressModeClamp;

    tex.addressMode[1] = cudaAddressModeClamp;

    tex.filterMode = cudaFilterModePoint;

    tex.normalized = false; 

cutilSafeCall( cudaBindTextureToArray(tex, d_imageArray) );

}

tera · November 10, 2011, 12:54pm

The [font=“Courier New”]if[/font] conditions in this code totally kill performance as at any time only a quarter of the threads will be active at any time. Avoid this by calculating four neighboring pixels per thread.

I’d guess a web search will bring up more optimized codes.

mbeckett · November 10, 2011, 4:40pm

Is that necessarily true? Each input pixel is processed at one point in the loop - it’s just determining which 1 of 4 in the pattern we are. Or do 'if’s automatically spawn a thread?

I wasn’t sure how to process 4 pixels per thread - how do I arrange that the kernel is only called for every 1/4 input pixels?

Do I simply skip 3 out of 4 threadIdx or is there a way of using the blocksize/gridsize to arrange this?

Surprisingly not!

There are some examples of very advanced demosaicing codes which would be completely impractical without CUDA. I would have thought there was a clever trick using texture hardware to essentially rescale the 1 in 4 red pixels to the entire image size but I can’t work it out.

But generally it’s not worth going to the GPU for a simple case, SSE2 is enough - it’s just that this is the last step of a series of operations.

tera · November 10, 2011, 7:07pm

Don’t do this, as it will lead to exactly the same problem: The skipped threads leave holes in your block of threads, which occupy resources but produce no results.

Assuming even pixel dimensions, you could call this kernel with a grid of only half the x and y size:

__global__ void convertGRBG(uchar4 *d_output, uint width, uint height)

{

    uint x = 2 * (__umul24(blockIdx.x, blockDim.x) + threadIdx.x);

    uint y = 2 * (__umul24(blockIdx.y, blockDim.y) + threadIdx.y);

    uint i = __umul24(y, width) + x;

// input is GR/BG output is BGRA

    if ((x < width-1) && (y < height-1)) {

        // x+1, y+1:

        d_output[i+width+1].x = make_uchar4( (tex2D(tex,x+2,y+1)+tex2D(tex,x,y+1))/2,  // B                

                                             (tex2D(tex,x+1,y+1)),     // G in B

                                             (tex2D(tex,x+1,y+2)+tex2D(tex,x+1,y))/2,  // R                    

                                             0);

        // x, y+1:

        d_output[i+width].x =   make_uchar4( (tex2D(tex,x,y+1)),        //B

                                             (tex2D(tex,x+1,y+1) + tex2D(tex,x-1,y+1)+tex2D(tex,x,y+2)+tex2D(tex,x,y))/4,  // G

                                             (tex2D(tex,x+1,y+2) + tex2D(tex,x+1,y)+tex2D(tex,x-1,y+2)+tex2D(tex,x-1,y))/4,   // R

                                             0);

        // x+1, y:

        d_output[i+1].y =       make_uchar4( (tex2D(tex,x+2,y+1) + tex2D(tex,x+2,y-1)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4;  // B

                                             (tex2D(tex,x+1,y)),        //R

                                             (tex2D(tex,x+2,y) + tex2D(tex,x,y)+tex2D(tex,x+1,y+1)+tex2D(tex,x+1,y-1))/4,  // G

                                             0};

        // x, y:

        d_output[i].x =         make_uchar4( (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2,  // B

                                             (tex2D(tex,x,y)),               // G  in R               

                                             (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2,  // R                    

                                             ,0};

    }

}

To optimize the bandwidth further, I’ve assumed that the unused .w channel does not need to be preserved.

OK, maybe it’s not an interesting enough optimization problem, as it is entirely memory bandwidth bound and the above code should take you “almost there”. And as a bandwidth bound problem with a “read once, write once” characteristic it’s only worth doing on the GPU if the data is already there.

That would be easy to achieve if the colors were stored in separate textures. ~~Otherwise I don’t see how the gaps between same-color pixels would allow this trick~~ (see post below).

Maybe you could rearrange the output of the previous step to produce this format. Then again, if you were to change the previous step, integrating the Bayer demosaicing into that would entirely eliminate the bandwidth need of the demosaicing.

Yes. If the data weren’t already on the GPU, it would be impossible to make up for the time spent on piping the data through PCIe.

mbeckett · November 10, 2011, 8:25pm

Doh - yes of course it is as simple as that!
I had the gridsize generated from the image size as a side effect of reading the raw data so it was also used to allocate the textures, hadn’t occurred to me that the grid doesn’t need to be the texture_dimensions/blocksize.

Ironically the original code wasn’t that slow - the cuda timing function returns millisecs on Windows not seconds.
The 4-in-1 replacement is exactly the same speed, but much neater.

Thank you

tera · November 10, 2011, 11:46pm

I don’t see how to solve that one, but you can let the texture unit do the average of two horizontal or vertical pixels by using floating point coordinates halfway between two pixels. To bridge the vertical gap, you can just redefine the texture with half as many rows and twice as many columns. Bridging the horizontal gap requires using uchar2 as the texture type.

Given that this kernel is (texture) bandwidth limited and has plenty of spare arithmetic capacity, about the same speed can probably achieved by just reusing previous values from texture interpolations with identical coordinates in the above kernel.

mbeckett · November 15, 2011, 11:07pm

Since the red (or blue) pixels in the Bayer mask are essentially a 1/2 resolution version of the image I thought you could use textures to generate a full size image using the hardware bilinear texture unit.

But this would mean copying all the reds into a contiguous array and then re-shuffling the resulting red/green/blue planes back into a single RBG(A) image - which would take longer than the CUDA code above.

I just wondered if there was a clever trick using some stepping parameter to do this in one pass - that everyone else knew about except me!

tera · November 16, 2011, 2:49am

That is what I was talking about in my previous post. Instead of reshuffling the data, you use the tricks mentioned above to get the texture unit to interpolate between pixels that are not adjacent.

I do not expect a speedup from this though, as arithmetic operations are not a limiting factor in the above kernel (memory bandwidth is).

More is probably gained from aggregating the two writes to horizontally adjacent pixels into one write of a 64bit data type.

Topic		Replies	Views
demosaicing raw image data nearest neighbor algorithm CUDA Programming and Performance	10	13368	May 2, 2010
Faster method of shifting texture data CUDA Programming and Performance cuda	11	488	July 31, 2023
Re_arranging Cuda array CUDA Programming and Performance	8	49	September 23, 2024
Malvar-He-Cutler Demosaicing Code and optimization advice? CUDA Programming and Performance	11	6065	January 23, 2011
How would you do this? CUDA Programming and Performance	12	4467	August 5, 2008
Low performance and high CPU usage CUDA Programming and Performance	13	19458	August 7, 2007
Kernel for pixel format conversions CUDA Programming and Performance cuda	6	834	October 16, 2020
CUDA performance How to improve my code? CUDA Programming and Performance	16	13574	June 29, 2010
Texture memory fetch extremely slow CUDA Programming and Performance	13	3125	December 21, 2017
horizontal lines after RGBtoRGBA and RGBAtoRGB in place algorigthm CUDA Programming and Performance	5	16759	December 23, 2010

Cuda demosaicing running slowly

Related topics