Weird error/bug ?!?

weigo · May 21, 2007, 6:32am

Hi,

I wrote a kernel, a basic edge detector, it looks like this:

void __global__ kernelEdgeb3(uchar4* out, int pitch)

{

	int y = blockDim.y * blockIdx.y + threadIdx.y;

	int x = blockDim.x * blockIdx.x + threadIdx.x;

	int sumx, sumy;

	uchar4 result;

	uchar4 rgb1 = texfetch(texb4_a, x-1, y-1);

	uchar4 rgb2 = texfetch(texb4_a, x, y-1);

	uchar4 rgb3 = texfetch(texb4_a, x+1, y-1);

	uchar4 rgb4 = texfetch(texb4_a, x-1, y);

	uchar4 rgb6 = texfetch(texb4_a, x+1, y);

	uchar4 rgb7 = texfetch(texb4_a, x-1, y+1);

	uchar4 rgb8 = texfetch(texb4_a, x, y+1);

	uchar4 rgb9 = texfetch(texb4_a, x+1, y+1);

	int3 sumX;

	int3 sumY;

	sumX.x = (int)rgb1.x - rgb3.x + ((int)rgb4.x - rgb6.x)*2 + rgb7.x - rgb9.x;

	sumX.y = (int)rgb1.y - rgb3.y + ((int)rgb4.y - rgb6.y)*2 + rgb7.y - rgb9.y;

	sumX.z = (int)rgb1.z - rgb3.z + ((int)rgb4.z - rgb6.z)*2 + rgb7.z - rgb9.z;

	sumY.x = (int)rgb1.x - rgb7.x + ((int)rgb2.x - rgb8.x)*2 + rgb3.x - rgb9.x;

	sumY.y = (int)rgb1.y - rgb7.y + ((int)rgb2.y - rgb8.y)*2 + rgb3.y - rgb9.y;

	sumY.z = (int)rgb1.z - rgb7.z + ((int)rgb2.z - rgb8.z)*2 + rgb3.z - rgb9.z;

	sumx = min((abs(sumX.x) + abs(sumX.y) + abs(sumX.z)) / 3, 255);

	sumy = abs(sumY.x) + abs(sumY.y);

/*---->*/   //sumy += abs(sumY.z);

	sumy = min(sumy / 3, 255);

	result = make_uchar4(sumx, 0, sumy, 255);

	//return result

	out = (uchar4*)( ((uchar1*)out) + pitch*y );

	out[x] = result;

}

The input texture is of type texture<uchar4, 2, cudaReadModeElementType> and holds RGBA information. I’m just applying the sobel discrete filter kernel for all of the color channels and combine the results afterwards.

Most of the stuff you can just overread and get to the line i marked with an arrow. Uncommenting this line, makes my resulting image grey. That should be impossible, since make_uchar4(…) sets the green channel always to zero.

It seems, that the problem is accessing sumY.z ! Any other variable or constant works.

Further test gave me, that rgb2.z - rgb8.z causes the failure (grey result image). Outputting rgb2.z or rgb8.z alone works.

Also, the whole calculation works, if i comment out the calculation of any of the components of sumX. So this might be some kind of stack overflow?

What might cause this problem?

I’m working with MSVS 2005 on WinXP with a 8800GTX.

Thanks in advance!

mfatica · May 21, 2007, 2:44pm

Can you post a complete example, so that we can try to reproduce the bug?

Thanks
Massimiliano

weigo · May 21, 2007, 4:02pm

Thanks for your interest!

The following function gets called by WinMain. ( Just: int WINAPI WinMain(…) { rcFilterEdgeTest(); return 0; } )

texture<uchar4, 2, cudaReadModeElementType> testtex;

void __cdecl rcFilterEdgeTest()

{

	const int w = 720;

	const int h = 576;

	cudaArray* testarray;

	cudaChannelFormatDesc testformat = cudaCreateChannelDesc<uchar4>();

	if (cudaMallocArray(&testarray, &testformat, w, h) != cudaSuccess)

  return;

	void* testout;

	if (cudaMalloc(&testout, 4*w*h) != cudaSuccess)

  return;

	if (cudaMemset(testout, 128, 4*w*h) != cudaSuccess)

  return;

	if (cudaBindTexture(testtex, testarray) != cudaSuccess)

  return;

	testtex.normalized = 0;

	testtex.addressMode[0] = cudaAddressModeClamp;

	testtex.addressMode[1] = cudaAddressModeClamp;

	testtex.filterMode = cudaFilterModePoint;

	dim3 threads(16, 16, 1);

	dim3 blocks(w / threads.x, h / threads.y, 1);

	kernelEdgeb3<<<blocks, threads>>>((uchar4*)testout, w*4);

	cudaUnbindTexture(testtex);

	uchar4 result[3];

	if (cudaMemcpy(result, testout, 4*3, cudaMemcpyDeviceToHost) != cudaSuccess)

  return;

}

I used the debugger to check the values in “result”.

My kernel was slightly modified, the problem remained the same:

void __global__ kernelEdgeb3(uchar4* out, int pitch)

{

	int y = blockDim.y * blockIdx.y + threadIdx.y;

	int x = blockDim.x * blockIdx.x + threadIdx.x;

	int sumx, sumy;

	uchar4 result;

	short3 sumX;

	short3 sumY;

	{

  uchar4 rgb1 = texfetch(texb4_a, x-1, y-1);

  uchar4 rgb3 = texfetch(texb4_a, x+1, y-1);

  uchar4 rgb7 = texfetch(texb4_a, x-1, y+1);

  uchar4 rgb9 = texfetch(texb4_a, x+1, y+1);

  uchar4 rgbg1;

  uchar4 rgbg2;

  {

  	rgbg1 = texfetch(texb4_a, x-1, y);

  	rgbg2 = texfetch(texb4_a, x+1, y);

  	sumX.x = (short)rgb1.x - rgb3.x + ((short)rgbg1.x - rgbg2.x)*2 + rgb7.x - rgb9.x;

  	sumX.y = (short)rgb1.y - rgb3.y + ((short)rgbg1.y - rgbg2.y)*2 + rgb7.y - rgb9.y;

  	sumX.z = (short)rgb1.z - rgb3.z + ((short)rgbg1.z - rgbg2.z)*2 + rgb7.z - rgb9.z;

  }

  {

  	rgbg1 = texfetch(texb4_a, x, y-1);

  	rgbg2 = texfetch(texb4_a, x, y+1);

  	sumY.x = (short)rgb1.x - rgb7.x + ((short)rgbg1.x - rgbg2.x)*2 + rgb3.x - rgb9.x;

  	sumY.y = (short)rgb1.y - rgb7.y + ((short)rgbg1.y - rgbg2.y)*2 + rgb3.y - rgb9.y;

  	//sumY.z = (short)rgb1.z - rgb7.z + ((short)rgbg1.z - rgbg2.z)*2 + rgb3.z - rgb9.z;

  }

	}

	sumx = max(max(abs(sumX.x),abs(sumX.y)), abs(sumX.z));

	sumy = max(max(abs(sumY.x),abs(sumY.y)), abs(sumY.z));

	result.x = min((short)hypotf(sumx, sumy), 255);

	result.y = (unsigned char)((int)(atanf((float)sumx / fmaxf((float)sumy, 0.0001f)) * DIR_MULT + 0.5f) + 127);

	result.z = 0;

	result.w = 255;

	//return result

	out = (uchar4*)( ((uchar1*)out) + pitch*y );

	out[x] = result;

}

When uncommenting the sumY.z assignment, the kernel returns without changing the output buffer. You can ignore the line with the undefined DIR_MULT or just set result.y to zero. Doesn’t change anything.

Since my test code is still in my big project there are several includes, like direct show. But no direct show function gets executed (i got CoInitialize() or something similar in mind to cause problems, but i commented out all this stuff). And direct show can’t be included by sources compiled with cuda…

Is it more safe to use the low level API especially when launching kernels?

Thanks again for your interest!

weigo

weigo · May 22, 2007, 8:53am

Hi again!

I wrote another kernel, bringing up different errors.

void __global__ kernelTest(uchar4* out, int pitch)

{

	int y = blockDim.y * blockIdx.y + threadIdx.y;

	int x = blockDim.x * blockIdx.x + threadIdx.x;

	out = (uchar4*)( ((uchar1*)out) + pitch*y );

	uchar4 rgb[8]; //no center

	rgb[0] = texfetch(texb4_a, x-1, y-1);

	rgb[1] = texfetch(texb4_a, x  , y-1);

	int bestfit = 0;

	if (rgb[0].x > rgb[1].x)

	{

  bestfit = 1;

	}

	out[x] = rgb[bestfit];

}

This results in an “unspecified driver error” for the next call to cudaMemcpyToArray. If i set bestfit to 1 outside the conditional statement, the kernel works fine. If i don’t use bestfit to index the array, but to set the output color directly, the kernel also works fine.

Slowly but steady i believe my system is broken.

Another question: Is it hazardous to link more than one object file generated by nvcc?

TIA,

weigo

weigo · May 22, 2007, 9:32am

I tried to run in emulation mode. The first example i posted throws an exception in kernelEdgeb3. The debugger was not able to track the current execution position. It switched to the file where the filter kernel is defined, but pointed to a blank line. Two execution steps later the exception arises.

The obvious and only solution: Smash the whole computer. I will try it and post the results.
That aside, what else can i do to find the source of all evil?

mfatica · May 22, 2007, 1:38pm

We were able to replicate the bug, I will enter it in the bug tracking system.

weigo · May 23, 2007, 9:15am

Which of the kernels have you tested?