CUDA has turned into CRAP (cuModuleLoad fails, bug in compiler/driver ?)


CUDA Driver API cuModuleLoad fails as shown in this video:

cuModuleLoad raises a floating point exception which prevents my application from running.

What is even more weird is that depending on which cuda toolkit compiler version and settings was used the 32 bit floating point version might or might not load/run or the 64 bit floating point version might or might not run.

The following winrar file contains (two versions) 32 bit and 64 bit floating point kernels and matching executables (Float vs Double):

If compiled with cuda toolkit 4.2 in debug mode the 32 bit floating point kernel will load and run.

If compiled with cuda toolkit 6.0 in release mode the 64 bit floating point kernel will load and run.

All others will seem to fail.

Try compiling the kernel yourself into a ptx file on your system and see the results for yourself !

If it runs it would be nice to get a video of it as proof just for the fun it !


I would suggest making the most isolated case possible. As can be seen by your video, there is quite a lot of external code that goes into calling the kernel, and the bug is probably elsewhere.

For what it’s worth, I compiled your kernel as nvcc -arch=sm_35 --ptx
(with and without -G) flag.

For -G flag I get an access violation of nvcuda.dll, a division by zero error, the app opens to a black screen, and some CUDA_SUCCESS messages, although clearly nothing is working.

Without the -G flag I get a green screen and a bunch of CUDA launch fails.

The division by zero is caused because there is no kernel loaded.

The code assumes the kernel was loaded and so forth.

If the kernel loaded properly there will be no division by zero.

I am debugging the Delphi code right now… it seems the mHandle is nil… (in the call to cuModuleLoad) that’s clearly a problem… investigating.

mCudaErrorCode := cuDeviceGet( mHandle, mNumber );

^ This API call is supposed to set the mHandle to something.

It’s returning nil/0 if mNumber is zero ? WTF ?!

According to the cuda.h header cuDeviceGet returns handles within legal range 0 to N-1.

So it returning a handle of zero seems to be valid, so that’s probably not the problem.

Thus I now believe the problem is caused by the cuda compiler and/or cuda driver generating/and/or loading the ptx instructions.

Perhaps it’s a “just-in-time bug” lol, like a “just-in-time comepiler” :)

Testing ptx instructions, floats, doubles, and just in time compiling inside the driver is well beyond “my scope”. I have no tools available to investigate problems like these ?

Perhaps nvidia also has no or limited tools available to diagnose these kinds of problems.

Any help/suggestions/tools how to debug “ptx issues” and or “just-in-time compiler/driver/bugs” is welcomed.

Little possibly insignificant update to this problem:

The test application creates a special cuda context with cuGLCtxCreate instead of cuCtxCreate. cuGLCtxCreate has been deprecated, I am not sure if using this special context is creating the problem.