cuda-gdb and cudaThreadSynchronize()

I’m encountering an “unspecifed launch error” upon calling cudaThreadSynchronize() only when running my test application under cuda-gdb…

The application launches a sequence of kernel functions synchronously to process one 512x256 32-bit frame. It executes the sequence of kernel functions iteratively to process multiple frames. Each iteration ends with a call to cudaThreadSynchronize(), then uploads the results from global memory via cudaMemCpy() calls, and verifies the results. The application employs page-locked host buffers for I/O, but calls cudaMemCpy() to transfer the data.

When launched from the command-line, execution of multiple iterations is successful in emulator, release, and debug modes. But, when executed under cuda-gdb, the second iteration fails with an “unspecified launch error” upon calling cudaThreadSynchronize(). The first iteration completes successfully, but runs dog-slow compared to running the debug-built application from the command-line (outside of cuda-gdb).

My platform is RHEL 5.2, GeForce 8600 GTS with 256 MB global memory, CUDA 2.1. No X11 is installed on the host.

Does anyone have an inkling? I was wondering if kernel timeouts are possible on Linux, even w/o X11.

Thank you in advance for your insight…

Can you post a repro case for this problem?