Weird error

The following kernel does some calculations and stores it in an output buffer previously allocated on the device. You don’t have to keep track of what’s being calculated, you should just recognize, that at the end of the kernel “threadIdx.y” should set some parts of the output buffer to non-zero values. But the kernel leaves the output memory completely unchanged (only zeros as previously set by cudaMemset). If you replace “uv[0].x” in the last line of the kernel by 0, the resulting array “result” contains the correct index values set by threadIdx.y in the z component.

Also other parts of the kernel can be commented out or changed to get it working. For example: removing " + jump.y * pos.x" from the computation in the inner for-loop makes the kernel behave as expected.

Here’s my host function:

static const int CENTROID_BUFFER_SIZE = 3 * 3 * 50 * sizeof(float3);

#define VHULL_SCALE 2.0f



dim3 blocks(1, 1, 1);

dim3 threads(3, 50, 3);

cudaMalloc((void**)&centroids, CENTROID_BUFFER_SIZE);

cudaMemset(centroids, 0, CENTROID_BUFFER_SIZE);

kernelCentroid<<<blocks, threads>>>(centroids);

float3 result[CENTROID_BUFFER_SIZE / sizeof(float3)];

cudaMemcpy(result, centroids, CENTROID_BUFFER_SIZE, cudaMemcpyDeviceToHost);


Here’s my kernel:

void __global__ kernelCentroid(float3* out)


	int x = threadIdx.x;

	int y = blockDim.y * blockIdx.y + threadIdx.y;

	int z = threadIdx.z;

	out += blockDim.x * blockDim.z * y + blockDim.x * z + x;

	__shared__ float3 spanPerThread;

	__shared__ float2 jump;

	__shared__ float offset;

	if (y == 0)


  spanPerThread = make_float3(VHULL_SCALE / blockDim.x, 

  	VHULL_SCALE / (blockDim.y * gridDim.y), 

  	VHULL_SCALE / blockDim.z);


  offset = -VHULL_SCALE * 0.5f;

  jump = make_float2(spanPerThread.x, spanPerThread.z);


	__syncthreads(); // wait for shared data to be written

	float3 spos = make_float3(spanPerThread.x * x + offset, spanPerThread.y * y, spanPerThread.z * z + offset);

	float2 uv[4];

	for (int iz=0; iz < 20; iz++)



  float3 pos = make_float3(spos.x + jump.x * x, spos.y, spos.z + jump.y * z);

  for (int i=0; i < 4; i++)



  	uv[i] = make_float2(offset*pos.z, offset*pos.y + jump.y * pos.x);



	(*out) = make_float3(uv[0].x, 0, threadIdx.y);


My system:

Intel Core Duo 6600

Windows XP Service Pack 2

CUDA 1.0 (Toolkit and SDK for WinXP 32bit)

Display Driver version 162.01

I’m using Visual Studio 2005 Pro and that’s my command line:

C:\CUDA\bin\nvcc.exe -ccbin “D:\Programme\Microsoft Visual Studio 8\VC\bin” -c -DWIN32 -D_DEBUG -D_MBCS -Xcompiler /Gm,/RTC1,/W3,/nologo,/Wp64,/Od,/Zi,/MTd -I"C:\CUDA\include" -I./ -I./include -I"C:\CUDA/common/inc" -o Debug\visualhull_cuda.obj d:\RIP\src\

The object file gets linked to other object files (some of them also compiled with nvcc).

I checked for a error right after the kernel invocation with:

cudaError_t err = cudaThreadSynchronize();

if(cudaSuccess != err)




But no error is reported.

My graphics card is a 8800 GTX.

In device emulation mode, the code works fine, although i could not walk through the device code with the debugger.

You have 450 threads in a block. While this is under 512, perhaps your kernel uses too many registers to reach that size. Check the cubin output to see how many registers you are using.

Of course, this should be reported as an error that you say you don’t get. Try calling cudaThreadSynchronize() and then cudaGetLastError(). It has been said by others on the forums that cudaThreadSynchronize() doesn’t return errors all the time.

Lastly, you can step through device code when compiled in emulation mode. You just can’t step INTO the kernel call. Set a breakpoint inside the device code and everything should work.

I will try to reduce the number of threads and/or check the cubin on Monday as well as try to catch an error with cudaGetLastError().

I thought stepping through kernel code (which i meant when talking about device code) is the main advantage of using device emulation. Of course i can debug non-kernel code compiled with CUDA, but this is also possible when compiling without device emulation.

Thank you for your effort.

Sorry if my above statements were confusing. You CAN step through kernel code in emulation mode. I do it all the time when debugging. Set a breakpoint inside the kernel, and it should work.

You were right! The register count is 21, which is too much. I halfed the number of threads per block, now it works, but the occupancy is at 33%.

I have to find a way to reduce the register count…

Thanks a lot!