My program used both runtime api (for embed kernels written with cuda C) and driver api(handling cubin built from hand written PTX). The standalone cubin is loaded at a class constructor, then it will be invoked repeatedly in a member function, and the module will be unloaded in the destructor, so the life cycle seems ok. When the program is built without optimization, it runs perfectly, but once the optimization is turned on, the kernel in standalone cubin just fails silently, all outputs are zeros, and all CUresults just returned success. If the module is reloaded every time the kernel is invoked, it works again.
I’ve checked that before invoking the kernel, the module is still alive and the function address as well as global variable address can be retrieved successfully. The context is also alive and unchanged during whole life cycle of the program. So I think that may be a problem for the optimization and driver api to work together. My program is built with CUDA 10.0 with GTX980Ti on Win7. Actually the same program can be compiled and run at another computer with RTX 2080Ti on Win10(with optimization on), although sometimes it also mysteriously failed when some irrelevant codes are modified. This does not happen sporadically, once the program is compiled, it behaves the same repeatedly, either worked or failed.
BTW, I cann’t debug into the kernel with nsight (legacy mode, next-gen seems not work for win7), it just freeze my system without any response.
I’m sorry I can’t post the code here, so the problem may be quite difficult to locate. Probably I didn’t use the driver api right, or there are some points I’m missing…
Any hints or similar experiences would be helpful~ Thanks very much!