"unspecified launch error" when using doubles

I started using doubles instead of floats in my program, and the device synchronization is failing because of an “unspecified launch error.” I figured this had something to do with the fact that pointers and doubles had different sizes, so I switched over to a 64-bit build. However, that didn’t work either. Now the device either hangs (until it resets after 10 seconds – I modified the registry for that) or it still says that synchronization failed with the unspecified error.

Does anyone have any tips?

EDIT: Could it be due to the fact that I am still mixing in 32-bit integers in some classes?
EDIT 2: Nope.

not a lot of information given, i.e.

what OS are you on?
what nVidia driver revision?
and what’s the nVidia hardware?
what CUDA SDK was used and what’s the options given to nvcc…

OS: Windows 8.1 Pro, 64-bit
Driver: 347.62
Hardware: GeForce GT 750M
SDK: 7.0
Options: compute_30,sm_30

EDIT: Could it have something to do with the fact that I’m mixing references and pointers? (Because the size of a reference is still 4 in 64-bit builds.)

Without having knowledge of the code, the root cause will be difficult to pinpoint.

Does your code diligently check the return status of all CUDA and CUDA library API calls, as well as all kernel launches? There could be a failing GPU memory allocation, for example, as switching from float to double doubles the size of each allocated array, matrix, etc. Try running the app with cuda-memcheck, which offers out-of-bounds, race check, and API check capabilities.

sm_30 is an architecture with low throughput of double-precision computation compared with single-precision computation. Kernel run times may increase by 10x when you switch from computation on floats to computation on doubles. This may cause a kernel to trigger the operating system’s watchdog timer limit (usually a couple of seconds) and be terminated. You can use the CUDA profiler to determine the run-time of individual kernels.

Yes, I check the return code of every single CUDA library call I make. I knew about the double-the-memory bit, but I can try running it with cuda-memcheck. If the performance hit will be as bad as you say, then I think I won’t even bother. I may have found a solution to the problem I was having earlier, so I may not even need double-precision anymore. If it turns out I do, then I will use cuda-memcheck and respond to this thread.

Thank you!