demosaicing raw image data nearest neighbor algorithm

Hi Forum!

I programmed a simple nearest neighbor algorithm for demosaicing (debayering) raw images. These images are stored in a OpenCV IplImage structure. Unfortunately I used to many IF-statments which makes my CUDA version three times slower than the equivalent CPU version. So I wonder someone can help me accelerate my algorithm. Or maybe one already programmed another demosaicing algorithm (i.e. biliear) in CUDA.

I would be grateful for your help. Thx for any helping advise.

Sandra

PS: Enclose you’ll find my code.

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

#include <cutil_inline.h>

#include <cv.h>

texture<unsigned char, 2, cudaReadModeElementType> imgTex;

//nearest neighbor kernel using texture memory

__global__ void debayer_TM( unsigned char *res, int width, int height )

{

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	

	if ( x > 1 && y > 1 && x <= width && y <= height )

	{

		if ( x%2==0 ) 

		{

			if ( y%2==0 )

			{

				res[y*width*3 + x*3]   = tex2D(imgTex, x-1, y-1); //blue

				res[y*width*3 + x*3+1] = tex2D(imgTex, x-1, y);   //green

				res[y*width*3 + x*3+2] = tex2D(imgTex, x, y);	 //red

			}

			else

			{

				res[y*width*3 + x*3]   = tex2D(imgTex, x-1, y); //blue

				res[y*width*3 + x*3+1] = tex2D(imgTex, x, y);   //green

				res[y*width*3 + x*3+2] = tex2D(imgTex, x, y-1); //red

			}

		}

		else

			if ( y%2==0 ) 

			{

				res[y*width*3 + x*3]   = tex2D(imgTex, x, y-1); //blue

				res[y*width*3 + x*3+1] = tex2D(imgTex, x, y);   //green

				res[y*width*3 + x*3+2] = tex2D(imgTex, x-1, y); //red

			}

			else

			{

				res[y*width*3 + x*3]   = tex2D(imgTex, x, y);	 //blue

				res[y*width*3 + x*3+1] = tex2D(imgTex, x-1, y);   //green

				res[y*width*3 + x*3+2] = tex2D(imgTex, x-1, y-1); //red

			}	

	}

}

extern "C" { void CudaDeBayerTM( IplImage *iplIn, IplImage *iplOut )  

{

	//declare device pointer

	unsigned char *DEVres;

	cudaArray *imgArray;

	// create channel descriptor for 2D cuda array

	cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<unsigned char>();

	//malloc device memory

	int size = sizeof(unsigned char)*iplIn->width*iplIn->height;

	cudaMallocArray(&imgArray, &channelDesc, iplIn->width, iplIn->height);

	cudaMalloc((void**)&DEVres, size*3);

	//copy host2device

	cudaMemcpy2DToArray(imgArray, 0, 0, (unsigned char*) iplIn->imageData, sizeof(unsigned char) * iplIn->widthStep, sizeof(unsigned char) * iplIn->width, iplIn->height, cudaMemcpyHostToDevice);

	// bind the array to the texture

	cudaBindTextureToArray(imgTex, imgArray, channelDesc);

	

	//launch kernel

	dim3 block(16, 8, 1);

	dim3 grid(iplIn->width/block.x, iplIn->height/block.y, 1);

	debayer_TM <<< grid,block,0 >>> ( DEVres, iplIn->width, iplIn->height );

	CUDA_SAFE_CALL(cudaThreadSynchronize());

	

	//copy device2host

	cudaMemcpy(iplOut->imageData, DEVres, iplIn->height*iplIn->width*3, cudaMemcpyDeviceToHost);

	//free memory on device and host

	cudaFreeArray(imgArray);

	cudaUnbindTexture(imgTex);

	cudaFree(DEVres); 

}}

32 bit Integer modulus is a very slow operation on the GT200 and earlier cards and you should see speed up if you can come up with another way of evaluating the odd/even condition. You should also get better write performance if you read to register variables from texture memory and then move the global memory writes to outside the conditional code with a synchronization barrier in front of them.

Also you are only using 8 threads per block, which is wasting about 75% of the computational capacity of each multiprocessor. Ideally threads per block should be a multiple of 32.

I think the uncoalesced byte writes are a huge factor in killing your performance.

Working with data structures that are multiples of 4 bytes (e.g. BGRA or RGBA) might make it easier to achieve memory coalescing.
That will require copying blocks of image data into shared memory (using coalesced reads) before performing the filtering. Then there is no necessity to use textures for reading the source data anymore. After filtering use coalesced writes to put the data back into global memory.

__global__ void debayer_TM( unsigned char *res, int width, int height )

{

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	   // indexing is [y&1],[x&1]

	   // Not sure that I didn't screw up the indices here 

	   // NOTE: ideally these arrays would reside in constant memory.

	   const int xblue[2][2]= { {-1,0}, {-1,0} };

	   const int xgreen[2][2] = { {-1,0}, {0,-1} };

	   const int xred [2][2] = { {0,-1}, {0,-1} };

	   const int yblue[2][2]= { {-1,-1}, {0,0} };

	   const int ygreen[2][2] = { {0,0}, {0,0} };

	   const int yred [2][2] = { {0,0}, {-1,-1} };

	

	res[y*width*3 + x*3]   = tex2D(imgTex, x+xblue[y&1][x&1], y+yblue[y&1][x&1]); //blue

	res[y*width*3 + x*3+1] = tex2D(imgTex, x+xgreen[y&1][x&1], y+ygreen[y&1][x&1]); //green

	res[y*width*3 + x*3+2] = tex2D(imgTex, x+xred[y&1][x&1], y+yred[y&1][x&1]); //red

}

This approach will get rid of the branch divergence. Note that memory writes remain uncoalesced.

Possibly move the “const int” arrays to constant memory to exploit the constant caches (will need to initialize them via cudaMemcpyToSymbol() then!)

Christian

But this is a modulo against compile-time constant. Isn’t the compiler detecting the modulo against 2^k number and replacing it with more faster “and” operation?

That is a good point. I don’t know the answer, but it would seem a pretty obvious and straightforward optimization. I should pull apart some PTX and have a look.

According to the Programming Guide, it does.

Hi Christian,

thank’s for your suggestion. You’re right! The number of divergent branches falls from 3590 down to 143. Unfortunately the GPU time (total) increase from 34% up to 75% (in msec: 450 (avg.) up to 2700 (avg.)).

I picked up your idea and replaced the MOD-operator with the AND-operator (pls. see enclosed code). The GPU time (total) decreased only 1%. Currently I haven’t got a clue how to speedup the algorithm. Neither how to make use of the shared memory nor how to make use of coalesced writes. May I asked you for some hints regarding my little project?

Thanks for advice.

Sandra

__global__ void debayer_TM( unsigned char *res, int width, int height )

{

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	

	if ( x > 1 && y > 1 && x <= width && y <= height )

	{

		if ( (x&1)==0 ) 

		{

			if ( (y&1)==0 ) 

			{

				res[y*width*3 + x*3]   = tex2D(imgTex, x-1, y-1); //blue

				res[y*width*3 + x*3+1] = tex2D(imgTex, x-1, y);   //green

				res[y*width*3 + x*3+2] = tex2D(imgTex, x, y);	 //red

			}

			else

			{

				res[y*width*3 + x*3]   = tex2D(imgTex, x-1, y); //blue

				res[y*width*3 + x*3+1] = tex2D(imgTex, x, y);   //green

				res[y*width*3 + x*3+2] = tex2D(imgTex, x, y-1); //red

			}

		}

		else

			if ( (y&1)==0 ) 

			{

				res[y*width*3 + x*3]   = tex2D(imgTex, x, y-1); //blue

				res[y*width*3 + x*3+1] = tex2D(imgTex, x, y);   //green

				res[y*width*3 + x*3+2] = tex2D(imgTex, x-1, y); //red

			}

			else

			{

				res[y*width*3 + x*3]   = tex2D(imgTex, x, y);	 //blue

				res[y*width*3 + x*3+1] = tex2D(imgTex, x-1, y);   //green

				res[y*width*3 + x*3+2] = tex2D(imgTex, x-1, y-1); //red

			}	

	}

}

It’s the uncoalesced char writes that are probably killing your performance, try writing uchar4s instead:

__global__ void debayer_TM( uchar4 *res, int width, int height )

{

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	uchar4 c;

	if ( x > 1 && y > 1 && x <= width && y <= height )

	{

		if ( (x&1)==0 )

		{

			if ( (y&1)==0 )

			{

				c.x = tex2D(imgTex, x-1, y-1); //blue

				c.y = tex2D(imgTex, x-1, y);   //green

				c.z = tex2D(imgTex, x, y);	 //red

			}

			else

			{

				c.x   = tex2D(imgTex, x-1, y); //blue

				c.y = tex2D(imgTex, x, y);   //green

				c.z = tex2D(imgTex, x, y-1); //red

			}

		}

		else

			if ( (y&1)==0 )

			{

				c.x  = tex2D(imgTex, x, y-1); //blue

				c.y = tex2D(imgTex, x, y);   //green

				c.z = tex2D(imgTex, x-1, y); //red

			}

			else

			{

				c.x = tex2D(imgTex, x, y);	 //blue

				c.y = tex2D(imgTex, x-1, y);   //green

				c.z = tex2D(imgTex, x-1, y-1); //red

			}	

	}

	// write result

	res[y*width + x] = c;

}

What you’re seeing is that the const int arrays end up in (uncached) local memory, causing a lot of extra reads. So any speed gain from less branching is eaten up by the extra memory loads. One would have to declare these arrays constant int and move them to file scope, I guess. And then one must use cudaMemcpyToSymbol() to initialize the variable contents. Only then I’d expect my proposed code to be faster than your original version (and also Simon Green’s uchar4 version, which still branches a lot)

Hello Simon,

that’s great! Now I can see what you mean by using coalesced character writes. I understand the basic idea. Thank you very much for your help!!!

Since my iplImage doesn’t include an alpha channel I’m now using the type ‘uchar3’ for calculating the three colors. The last character (here: c.w) would have been messed up my resulting image.

The runtime of the kernel now decreased from 34% down to 25%. For one image with 1000 x 1000 pixels the calculation now takes between 260 and 340 msec. That takes me to my next question. When I profiled the kernel runtime I discovered considerable fluctuations. What are the reasons for runtime differences about 80 msec. (340-260=80) in GPU time? Please take a look at the enclosed screenshot “kerneltime.png”. Till now I couldn’t find an answere to the high CPU times, too. Why does a calculation of 270 msec. at GPU takes over 1662 msec. at CPU time (e.g. line 27). Also please take a look at line 28: here the GPU time takes 304 msec. and the CPU time takes very long 4984 msec. Is this in consequence of the slow latency of memory chips?

Maybe that’s the same reason for the long GPU idle time between the upload to device an the kernel execution (please take a look at the enclosed screenshot “idletime.png”). How can I minimize this gap? Should (or better: could) I make use of streams or asynchron memcopys? Again I missing the forest through the trees. - May I please asked you for some answeres?

Thanks for your help.

Sandra

idletime.png
kerneltimes.png