Code runs with LLVM CPU but not with GPU (iMac, Leopard)

Hello,

I wrote a small medianfilter kernel with variable filtersize as an exercise for OpenCV and noticed some strange behaviour.
The driver used is the one that is delivered with Leopard (iMac, nVidia).

Here is the code:


#define MaxMedianRange 7 // → max num of pixels == (2*7+1)^2 = 225

__kernel void Median( __global const uchar4 * Source, __global uchar4 * Dest, __global uchar4 * Err,
const int ImageWidth, const int ImageHeight, const int MedianRange
)
{
// get position to be filtered
int PosX = get_global_id(0);
int PosY = get_global_id(1);
int Offset = PosY*ImageWidth+PosX;

        // check filtersize limitation
if ( MedianRange > MaxMedianRange )
{
	Dest[Offset] = Source[Offset];
	return;
}

        // this buffer should be big enough for sorting the max. 225 values,
        // 256 is chosen for alignment (I don't know if this makes a difference...)
uchar Buffer[256];

        // filter every channel seperately
for ( int Channel = 0; Channel<3; Channel++ )
{
	uint o=0; // running offset

                    // copy surrounding pixels into a buffer
	for ( int y=PosY-MedianRange; y<=PosY+MedianRange; y++ )
	for ( int x=PosX-MedianRange; x<=PosX+MedianRange; x++, o++ )
	{
		uint dx,dy; // "unsigned" should be no problem here!
	
                                // check image boundaries
		if ( x<0 ) dx = 0;
		else if ( x>= ImageWidth  ) dx=ImageWidth -1;
		else ( dx = x );

		if ( y<0 ) dy = 0;
		else if ( y>= ImageHeight ) dy=ImageHeight-1;
		else ( dy = y );

		Buffer[o] = Source[dy*ImageWidth+dx][Channel];
	}

	// bubblesort the buffer, offset "o" should have resulted in the number of pixels in it.
	bsort( Buffer, o );

	uchar median, original, error;

                    // select middle pixel in the buffer as median result
	median = Buffer[MedianRange];
	Dest[Offset][Channel] = median;

                    // calculate some extra error value for this pixel
	original = Source[Offset][Channel];
	error = 127+((int)median-(int)original)/2;
	Err[Offset][Channel]  = error;
}

        // set alpha to max. value
Dest[Offset][3] = 255;
Err[Offset][3]  = 255;

}

The “bsort(…)” is a trivial inplace bubblesort function and shouldn’t be responsible for the strange behaviour (hopefully!)

The result is correct if CPU is selected as the device.
But if GPU is selected the folowing problems occur:

  • The filtering looks “ok”, but the blue channel seems to be set to max. (255), so the image is very blueish. Alignment problem?
  • The driver crashes if the median range is set to 4…7, only 1…3 run without crash, but result in the blueish mentioned above…

So the problem seems to be the compiler.
Are there suggestions for a workaround in the kernelcode or is there a compiler argument which may solve the problem?
Is there a new driver on its’ way or already available??

Thanks in advance!

Konstantin Dols

Ps: Oops, for some reason the spacing in the code doesn’t work and makes it harder to read, sorry for this!

I see you are using array subscriptions for uchar4, which doesn’t seem to be right. I’d recommend using Dest[Offset].w (as in xyzw) instead of Dest[Offset][3].

This is just my guess though.

In your main loop you probably has to manually unroll that if subscriptions can’t be used.

It’s possible that your kernel simply runs too long for larger median range. How long does it take to run for range = 3? If it already takes several seconds, then it’s probably the cause.

Note that, as in Windows, Snow Leopard will reset the GPU if it takes too long to run a kernel.

Also your code may need some work to make it run faster. Remember that current GPU do not have cache, so repeatedly read the same memory location can be very slow. Also it’s better to make read aligned (coalesced). You may want to try reading into a local memory before further processing.

I think you can use a “code” tag, such as

for(int i = 1; i <= 100; i++) {

	sum += i;

}

Hi there,

some loop-unrolling and using xyzw instead of 0123 solved the problem, thank you very much!

Now I can even filter up to the given bufferlimit without crash, even if it takes several seconds.

But there is still a little problem, it looks like some pixels are not filtered correclty,

and they appear in little rows and columns, like there were whole workgroups that are not processed.

Suggestions? :-D

Here is the (nearly) perfect code:

void bsort( uchar* _a, int _n )

{

	int n = _n;	

	bool swapped=true;

	while ( swapped || n>0 )

	{

		swapped = false;

		for ( int i=0; i<n-1; i++ )

		if ( _a[i] > _a[i+1] )

		{

			char tmp = _a[i];

			_a[i] = _a[i+1];

			_a[i+1] = tmp;

			swapped = true;

		}

		n--;

	}

}

#define MaxMedianRange 7

#define MaxBufferSize ((2*MaxMedianRange+1)*(2*MaxMedianRange+1))

	

__kernel void Median(__global const uchar4 * Source, __global uchar4 * Dest, __global uchar4 * Err,

					 const int ImageWidth, const int ImageHeight, const int MedianRange

					)

{

	int PosX	= get_global_id(0);

	int PosY	= get_global_id(1);

	int Offset  = PosY*ImageWidth+PosX;

	uchar4 Buffer[MaxBufferSize];

	uchar  rBuffer[MaxBufferSize];

	uchar  gBuffer[MaxBufferSize];

	uchar  bBuffer[MaxBufferSize];

	uint o=0;

	for ( int y=PosY-MedianRange; y<=PosY+MedianRange; y++ )

	for ( int x=PosX-MedianRange; x<=PosX+MedianRange; x++ )

	{

		int dx,dy;

	

		// clamp block to keep it inside the image

		if ( x < 0 )

			dx = 0;

		else if ( x > ImageWidth -1 )

			dx = ImageWidth -1;

		else

			dx = x;

		

		if ( y < 0 )

			dy = 0;

		else if ( y > ImageHeight-1 )

			dy = ImageHeight-1;

		else

			dy = y;

		Buffer[o++] = Source[dy*ImageWidth+dx];

	}

	

	// split buffers into seperated channels

	for ( int i=0; i<o; i++ )

	{

		rBuffer[i] = Buffer[i].x;

		gBuffer[i] = Buffer[i].y;

		bBuffer[i] = Buffer[i].z;

	}

	// bsort buffers

	bsort( rBuffer, o );

	bsort( gBuffer, o );

	bsort( bBuffer, o );

	// get median values and errors

	uchar4 original, median, error;

	int Center = 2*MedianRange*(MedianRange+1);

	original.x = Source[Offset].x;

	median.x = rBuffer;

	error.x = 127+((int)median.x-(int)original.x)/2;

	original.y = Source[Offset].y;

	median.y = gBuffer;

	error.y = 127+((int)median.y-(int)original.y)/2;

	original.z = Source[Offset].z;

	median.z = bBuffer;

	error.z = 127+((int)median.z-(int)original.z)/2;

	median.w   = 255;

	error.w	= 255;

	

	Dest[Offset] = median;

	Err [Offset] = error;

}

Greetz, Konstantin