cudaError_enum Strange kernel failure

Hello,

I have a reasonably basic kernel that multiplies two complex volumes, that generates an exception:

First-chance exception at 0x7c812a5b in mshta.exe: Microsoft C++ exception: cudaError_enum at memory

location 0x0983ea00…

First-chance exception at 0x7c812a5b in mshta.exe: Microsoft C++ exception: cudaError_enum at memory

location 0x0983ea70…

First-chance exception at 0x7c812a5b in mshta.exe: Microsoft C++ exception: cudaError_enum at memory

location 0x0983fd80…

__global__ void volume_complex_conjugate_multiply(Complex* pVolOut, const Complex* pC1, const Complex* pC2, 

const int maxindex)

{

	const int index = (blockIdx.y*gridDim.x*blockDim.x)+(blockIdx.x*blockDim.x)+threadIdx.x;

	

	if (index >=0 && index < maxindex && (threadIdx.x < blockDim.x || blockIdx.x < gridDim.x || 

blockIdx.y < gridDim.y))

	{

  Complex Res;

  Res.x = (pC1[index].x*pC2[index].x) + (pC1[index].y*pC2[index].y);

  Res.y = (pC1[index].x*pC2[index].y) - (pC2[index].x*pC1[index].y);

 pVolOut[index] = Res;

	}

}

I’ve tried to find the problem and thought to have found it a few times but it keeps coming back. The same

error appears when I make a basic “array out of bounds” error so I added multiple if-guards to prevent that and

also checked that the three pointers are not NULL in advance. All of the three pointers point to a

different volume and ofcourse their memory does not overlap. I have checked the error after the kernel

failure but it’s just a (meaningless) cudaErrorLaunchFailure. The kernel is simply called as follows:

threads[GPUId].x = 109

threads[GPUId].y = 1;

threads[GPUId].z = 1;

blocks[GPUId].x = 216

blocks[GPUId].y = 116

blocks[GPUId].z = 1;

volume_complex_conjugate_multiply<<<blocks[GPUId], threads[GPUId]>>>

This kernel is part of a larger multigpu application (8800 GTX and 8500 GT). For some reason the error only

occurs on the 8500 GT. (I am aware of the large difference in performance between the two GPU’s but thats

part of the experiment).

  • Execution time of the kernel falls way below the 5 sec timer (so its not the watchdog timer)

  • NVCC reports that the kernel uses 7 registers (7109 < 8192) and 32+28 bytes smem (60109 < 16k)

I have spent literally two days to find the source of this error, searched the forum for similar errors but

have not been able to solve it. If anybody has ANY idea what could cause this please post it.

Thanks in advance,

Kevin

Have you tried to check errors from ALL cuda-related commands on the host-side? Have you changed hardware reqs (like compute capability)

Yes, I have a macro which checks for error states around every Cuda function. What exactly do u mean by the question if I changed hardware requirements? How could I change the compute capability, do you mean in the program for instance by adding statements only supported by 1.1 comp. cap. ? As far as I know I am not using statements only supported by 1.1 (no atomics), also, the 8500 GT (on which the error occurs) has a higher comp. cap. (1.1) then the 8800 GTX (1.0), could this be the source of the problem ? It could be, but I would find it somewhat strange especially since the error always occurs in the same kernel…

Is there anybody who has any idea what could possibly go wrong in my code?

i’ve been getting exactly the same thing!

i’ve been wrapping cuda error checks around eveery call

also following every kernel call with cudaSychThreads()

i haven’t tested your code yet but when i DO get this error

it SEEMS to still run fine. how about you?

also it only does this in debug mode (of course). and when

i run without debugging, no probs.

When i get this error the thread in which it occurs exits immediately. Note, however, that the most likely reason is something like indexing an array out of bounds (which I’m very sure I’m not doing). I have also some other lead now (see http://forums.nvidia.com/index.php?showtopic=60956) which it could be related to as well since I have not put cudaSynchThreads after every cuda call. I’m now experimenting with loops full with mallocs and free and kernel launches to see what happens. I have not tried to run without debugging, will try that soon…

I think you could get this when in Visual Studio when you have exceptions turned on.
Hope that helps anyone.