Malvar-He-Cutler Demosaicing Code and optimization advice?

Hi,

so I implemented the demosaicing with the 5x5 kernel according to the paper by Malvar, He and Cutler. First of all, here is the Kernel code in case anyone needs it.

GetPixel and SetColorPixel are device functions to read the raw source data and write to the output image. They are currently specific to my images of a fixed size and therefore not posted here.

__global__ void gpu_MalvarHeCutler(unsigned short *d_raw, unsigned char *out, size_t rgbPitch, size_t pitch)

{

	__shared__ unsigned short subImage[20][20];

	int x = 16*blockIdx.x + threadIdx.x;

	int y = 16*blockIdx.y + threadIdx.y;

	int bx = threadIdx.x;

	int by = threadIdx.y;

	int col = threadIdx.z;

	unsigned short val;

	

	if (col == 0)			// Get inner image

	{

		subImage[bx+2][by+2] = GetPixel(d_raw, x, y, pitch);

	}

	else if (col == 1)		// Get sides

	{

		if (bx < 2) subImage[bx][by+2] = GetPixel(d_raw, x-2, y, pitch);  		// left side

		if (bx > 13) subImage[bx+4][by+2] = GetPixel(d_raw, x+2, y, pitch);  	// right side

		if (by < 2) subImage[bx+2][by] = GetPixel(d_raw, x, y-2, pitch);			// lower side

		if (by > 13) subImage[bx+2][by+4] = GetPixel(d_raw, x, y+2, pitch);		// upper side

	}

	else if (col == 2)		// Get corners

	{

		if (bx == 0 && by == 0) subImage[1][1] = GetPixel(d_raw, x-1,y-1, pitch);

		if (bx == 15 && by == 15) subImage[18][18] = GetPixel(d_raw, x+1,y+1, pitch);

		if (bx == 0 && by == 15) subImage[1][18] = GetPixel(d_raw, x-1,y+1, pitch);

		if (bx == 15 && by == 0) subImage[18][1] = GetPixel(d_raw, x+1,y-1, pitch);

	}

	__syncthreads();

	

	switch (col)

	{

	case _RED_:

		if (x % 2 == 0 && y % 2 == 0)

		{

			val = subImage[bx+2][by+2];

			SetColorPixel(out, x, y, _RED_, val, rgbPitch);

		}

		else if (x % 2 == 0 && y % 2 != 0)

		{

			val = 5*subImage[bx+2][by+2] + 4*subImage[bx+2][by+3] + 4*subImage[bx+2][by+1] - subImage[bx+2][by+4] - subImage[bx+2][by] - subImage[bx+3][by+3] - subImage[bx+3][by+1] 

					- subImage[bx+1][by+3] - subImage[bx+1][by+1] + 0.5*subImage[bx+4][by+2] + 0.5*subImage[bx][by+2];

			SetColorPixel(out, x, y, _RED_, 0.125*val, rgbPitch);

		}

		else if (x % 2 != 0 && y % 2 == 0)

		{

			val = 5*subImage[bx+2][by+2] + 4*subImage[bx+3][by+2] +4*subImage[bx+1][by+2] - subImage[bx+3][by+3] - subImage[bx+3][by+1] - subImage[bx+1][by+3] - subImage[bx+1][by+1] 

					- subImage[bx+4][by+2] - subImage[bx][by+2] + 0.5*subImage[bx+2][by+4] + 0.5*subImage[bx+2][by];

			SetColorPixel(out, x, y, _RED_, 0.125*val, rgbPitch);

		}

		else

		{

			val = 6*subImage[bx+2][by+2] + 2*subImage[bx+3][by+3] + 2*subImage[bx+3][by+1] + 2*subImage[bx+1][by+3] + 2*subImage[bx+1][by+1] 

					- 1.5*subImage[bx+4][by+4] - 1.5*subImage[bx+4][by] - 1.5*subImage[bx][by+4] - 1.5*subImage[bx][by];

					SetColorPixel(out, x, y, _RED_, 0.125*val, rgbPitch);

		}

		break;

	case _GREEN_:

		if (x % 2 == 0 && y % 2 == 0)

		{

			val = 4*subImage[bx+2][by+2] + 2*subImage[bx+3][by+2] + 2*subImage[bx+1][by+2] + 2*subImage[bx+2][by+3] + 2*subImage[bx+2][by+1]

					- subImage[bx+4][by+2] - subImage[bx][by+2] - subImage[bx+2][by+4] - subImage[bx+2][by];

			SetColorPixel(out, x, y, _GREEN_, 0.125*val, rgbPitch);

		}

		else if (x % 2 != 0 && y % 2 != 0)

		{

			val = 4*subImage[bx+2][by+2] + 2*subImage[bx+3][by+2] + 2*subImage[bx+1][by+2] + 2*subImage[bx+2][by+3] + 2*subImage[bx+2][by+1]

					- subImage[bx+4][by+2] - subImage[bx][by+2] - subImage[bx+2][by+4] - subImage[bx+2][by];

			SetColorPixel(out, x, y, _GREEN_, 0.125*val, rgbPitch); 

		}

		else

		{

			val = subImage[bx+2][by+2];

			SetColorPixel(out, x, y, _GREEN_, val, rgbPitch);

		}

		break;

	case _BLUE_:

		if (x % 2 == 0 && y % 2 == 0)

		{

			val = 6*subImage[bx+2][by+2] + 2*subImage[bx+3][by+3] + 2*subImage[bx+3][by+1] + 2*subImage[bx+1][by+3] + 2*subImage[bx+1][by+1] 

					- 1.5*subImage[bx+4][by+4] - 1.5*subImage[bx+4][by] - 1.5*subImage[bx][by+4] - 1.5*subImage[bx][by];

					SetColorPixel(out, x, y, _BLUE_, 0.125*val, rgbPitch);

		}

		else if (x % 2 != 0 && y % 2 == 0)

		{

			val = 5*subImage[bx+2][by+2] + 4*subImage[bx+2][by+3] + 4*subImage[bx+2][by+1] - subImage[bx+2][by+4] - subImage[bx+2][by] - subImage[bx+3][by+3] - subImage[bx+3][by+1] 

					- subImage[bx+1][by+3] - subImage[bx+1][by+1] + 0.5*subImage[bx+4][by+2] + 0.5*subImage[bx][by+2];

			SetColorPixel(out, x, y, _BLUE_, 0.125*val, rgbPitch);

		}

		else if (x % 2 == 0 && y % 2 != 0)

		{

			val = 5*subImage[bx+2][by+2] + 4*subImage[bx+3][by+2] +4*subImage[bx+1][by+2] - subImage[bx+3][by+3] - subImage[bx+3][by+1] - subImage[bx+1][by+3] - subImage[bx+1][by+1] 

					- subImage[bx+4][by+2] - subImage[bx][by+2] + 0.5*subImage[bx+2][by+4] + 0.5*subImage[bx+2][by];

			SetColorPixel(out, x, y, _BLUE_, 0.125*val, rgbPitch);

		}

		else

		{

			val = subImage[bx+2][by+2];

			SetColorPixel(out, x, y, _BLUE_, val, rgbPitch);

		}

		break;

	}

}

So the approach is to have 16x16x3 threads in a block, to work on a pixel area of 16x16 and the 3 colors. Currently my image size always is a multiple of 16, maybe some adaptions are necessary for different cases.

First, a tile of 20x20 pixel is loaded into shared memory. With this, the 5x5 kernels can be computed for all pixel and written to the output. Input and output memory is always allocated and used with cudaMallocPitch and cudaMemcpy2D.

The way I assign threads to fetch certain parts of the 20x20 subimage might be ugly, but I couldn’t figure out a better method.

Maybe someone has ideas how to improve this code, as it’s my first major project with CUDA. Do I have issues with warp divergence here, and what can be done about it? Are the modulo operations bad / slow? Could I make better use of shared memory?

Thanks for any input, and maybe someone might even find my code useful.

Edit: Oh, and RED, GREEN and BLUE are simply defined as 0, 1 and 2 to match a threadIdx.z value.

Check out the Image Convolution with CUDA whitepaper from the SDK.

Thanks, but how exactly does that help me?

The way the apron is loaded seems OK (avoids idle threads in the computation) and my kernels are not separable.

OK, the whitepaper isn’t of too much use then.

The most important question actually is about [font=“Courier New”]GetPixel()[/font], because that determines whether your memory access is efficient or not. (I was hoping that pointing you to the SDK example would make you think about coalesced memory access)

Next would be to get rid of all the conditionals in the compute part. As different code paths within one warp get executed sequentially, they make half of you Cuda cores sit idle. Worse, they brake coalescing. So make sure that you write back the result in exactly one place, and replace conditionals by using precomputed coefficients.

Finally you can get rid of blockIdx.x and instead loop over it inside your kernel. This allows you to reuse data from the previous loop iteration to avoid loading the left border, or with a little reorganization avoids loads of left and right borders at all. It also allows to reuse the precomputed coefficients, although that’s probably a minor improvement.

Which compute capability shall this code be optimized for?

OK, here is the code to GetPixel():

__device__ unsigned short GetPixel(unsigned short *image, int x, int y, size_t pitch)	//seems to work

{

	// Mirror values if out of bounds

	if (x < 0) x = -x;

	if (x > PIXEL_X-1) x = 2*(PIXEL_X-1)-x;

	if (y < 0) y = -y;

	if (y > PIXEL_Y-1) y = 2*(PIXEL_Y-1)-y;

	return *(unsigned short *)((char *)image + y*pitch + x * sizeof(unsigned short));

}

As you see it uses the constant size of the image (PIXEL_X and PIXEL_Y) to detect if a read is outside of the area and mirror the value, which is of course specific to my problem. The read operation however seems ok to me, right?

To make sure I understand, getting rid of the conditionals would mean something like replacing

if (x < 0) x = -x

by something like

coeffs[2] = {1,-1};

n = x < 0;

x = x * coeffs[n];

Is that the way to do it? Seems quite difficult for the many conditionals in the kernel but probably possible.

But I don’t get what you mean with replacing blockIdx.x by a loop. How is that supposed to look? I just use blockIdx.x to locate the subimage in the whole dataset.

The code is used on a GTX 465, so that would make it compute capability 2.0 (I think).

You could try replacing your GetPixel() with a texture lookup i.e. tex2D(). This would avoid coalescing issues. Unfortunately I don’t think we support mirror texture wrap mode in CUDA.

You might want to replace “x % 2 == 0” with “(x & 1) == 0”, integer modulo is relatively slow (although the compiler might be doing this anyway).

Yes, GTX 465 is compute capability 2.0. So we don’t have to worry about the stricter 1.0 or 1.1 coalescing rules then, and GetPixel() and the memory reads are fine.

That leaves the writes next. The kernel will probably be bandwidth bound then, so fixing the writeout part will probably be enough and eliminating the ifs not necessary.
Moving [font=“Courier New”]SetColorPixel(out, x, y, color, val, rgbPitch);[/font] out of the conditional, the kernel won’t have to rely on the write-combining cache to minimize overall memory writes. Same goes for the colors: Unfortunately I haven’t asked for the output format or the code of [font=“Courier New”]SetColorPixel()[/font]. If the output format has the colors interleaved, it might be worth computing red, green, and blue subpixels in the same thread and then writing out all at once using an appropriate uint3 or uint4 type.

Next are the computations. Maybe not worth it if the kernel’s throughput is bandwidth limited. You can unify all the different [font=“Courier New”]val=…[/font] cases into one that reads [font=“Courier New”]val = c0 * subImage[bx+2][by+2] + c1 * subImage[bx+dx1][by+dy1] + c2 * subImage[bx+dx2][by+dy2] + …[/font] with cn, dxn, and dyn suitably set up at the beginning of the kernel.

Finally the loop I mentioned: Instead of running blocks with different [font=“Courier New”]blockIdx.x[/font] in parallel, just set dimGrid.x=1. Then place the whole kernel (apart from the setup of the constants mentioned in the last paragraph) in a big loop [font=“Courier New”]for (MyBlockIdxX=0; MyBlockIdxX < …; MyBlockIdxX++) {…}[/font]. Now in each but the first iteration you do not need to reload the left border from memory, as it is already in shared memory as part of the previous iterations’ center block.
If you replace loading of the right border by loading the complete right block, you can also reuse that as the center block in the next next iteration, which gets rid of the extra memory accesses for left and right borders.

Most of the memory optimizations while however be more effective on compute 1.x parts. On 2.x, much of that is already taken care of by the cache, and the effect will be a lot smaller.

To answer one question from your first post: The modulo operation x%2 can be replaced by x&1, however the compiler should already take care of that by itself. Modulo operations that cannot be optimized that way are rather expensive.

All in all I would say that your code should do quite well on a 2.x device.

For complettion, here is the write function:

__device__ void SetColorPixel(unsigned char *target, int x, int y, int color, unsigned short value, size_t rgb_pitch)

{

	unsigned char *colorBase;

	colorBase = target + color * rgb_pitch * PIXEL_Y;

	*(colorBase + y*rgb_pitch + x) = value >> 4;

}

It writes to RGB format and converts to 8 bit by just shifting the original 12 bit values.

Thanks a lot for the explanations, I will try to implement some of the suggestions and see how much they improve. However

was the most important thing for me, as I was afarid I might have missed something serious and there’s an obvious way to improve dramatically.

When I have time I might polish the code, make it general (for all image sizes) and then post it again here, as I believe a lot people doing image processing could use a fast demosaicing method.

Edit: Oh, and if I were to use precomputed coefficients as suggested, where to I best store them? Constant memory? Just initialize as kernel variables? Or doesn’t it matter?

Some things i would try for performance optimization:

  • Use float for shared memory and all calculations, and only convert it back to ushort just be before storing it(there is a lot of hidden short->float conversion happening, example: 0.5*subImage[bx+4][by+2])
  • Use the coefficients instead of ifs as suggested, stored in shared memory(read from global memory once during block init)

Good spotting by Nighthawk13. There is no need for floating point arithmetic in your kernel at all. Although floating point operations by themselves are not slower than integer operations, the conversions cost time.

Probably won’t make a difference though as the kernel is probably memory bandwidth bound.

With the [font=“Courier New”]SetColorPixel()[/font] you posted everything would depend on whether [font=“Courier New”]rgb_pitch[/font] is 3 or 4. With 4 things get quite easy, while 3 would mix words from different threads. For optimizing the writeout you would then need to go through shared memory first so that threads can then write out full words to global memory.

For the precomputed coefficients just use registers (kernel variables), as these are the fastest way to have different values per thread. Only if that uses too many registers (so that occupation would be negatively affected), store them in shared memory. No need to read them from global memory though, recomputing them is faster.

Constant memory is not a good option as it requires multiple cycles if not all threads of a warp access the same memory location.

One thing I missed previously is that once you get rid of the ifs, your code will generate bank conflicts in shared memory. Declare [font=“Courier New”]subImage[/font] as [font=“Courier New”]unsigned int[/font] instead of [font=“Courier New”]unsigned short [/font]to avoid that.

This brings up the idea that you also could, instead of using precomputed coefficients, reorganize the way you assign threads to pixels in order to make sure that threads within a warp always take the same code path. However that will require a lot of brain twisting for a solution that is not going to be any faster than the idea with precomputed coefficients. It’s just another option.

I would see it the other way around: There is no need for integer arithmethic at all External Image

Floating point operations are equally fast or faster on fermi, as some integer operations can only execute on half the cuda cores.

If the GetPixel() function uses texture access, the initial unsigned short->float conversion (+the address calculation with pitch) is done by the texture unit.

If you want to stay with integer ops, i would replace “0.125*val” with (val>>3), for example.

val should be an unsigned int then, not unsigned short. Otherwise there will be hidden conversions between unsigned short<->unsigned int.

Check out this paper:

Efficient, High-Quality Bayer Demosaic Filtering on GPUs

It’s implemented in GLSL, but is easy to convert to Cuda (and very fast)