Emulation runs ok V.S. GPU run failed

The code I wrote was compiled by two modes: emulation and release mode. Now the problem is that:

  1. The emulated code runs ok, but the program generated from normal release mode failed to run: cuitlCheckMsg() CUTIL CUDA error: … unspecified launch failure. I am wondering whether there are some bugs in my code though I have checked my code, on the other hand, the codes runs over in the emulation mode, though,

  2. The result got from the emulation run varied among different kernel callings. In my code, I just repeated several times to call the kernel, the code organization seems as,

kernel <<< grid, threads >>>( … );

kernel <<< grid, threads >>>( … );

kernel <<< grid, threads >>>( … );

however, after the execution of the kernel, I copied device memory to host memory, but the results from each call of the kernel are not consistent.

Could any one please help me? Any suggestion is appreciated.

BTW, my card is GTX 285 under Fedora 8.

an unspecified launch failure usually means that you are writing to memory where you are not supposed to.
You can run your emulation code in valgrind to check for these. I believe there are even files on the forum with some general ignore-cases for valgrind (false positives)