My application stops working when upgrading CUDA. Tracing it in debugger, scheduling first kernel will hang.
Everything works fine on 12.1 . What is the best way of logging this bug ?
Note: I did a fresh install of Ubuntu 22.04 and 12.2, and app will hang on first kernel.
You might be hitting this. In a nutshell, try running your app on CUDA 12.2 with
CUDA_MODULE_LOADING=EAGER ./my_app
to see if the behavior changes. (change the my_app above to be the name of your actual compiled executable.)
Anyone who wants to can file a bug with NVIDIA using the bug reporting portal. The instructions are linked to a sticky post at the top of this sub-forum. However what you have here is not sufficient for a bug report. At a minimum, if you filed such a bug, the QA team would ask you for a minimal reproducer (short but complete code that demonstrates the issue) along with other instructions such as your compile command line, GPU you are running on, and maybe other things.
If you intend to do that (you can do as you wish, of course), you also have the option to post such a short, complete example and instructions here, and the community will generally have a look at it. They may spot something. If that were the case, it would be a better path than filing a bug, for a number of reasons.
Thanks, I will try this. Currently I have 12.1 installed, can I just install 12.3 as well and test ? Or do I need to uninstall 12.1. Also, I notice that 12.3 now has base installer separate from driver install. Do I need 545 to run 12.3 ?
You can use 12.2 or 12.3 to test my theory. You can install 12.2 or 12.3 “alongside” 12.1. It’s not necessary to uninstall 12.1. As you are already aware, you will need a new enough driver to support whichever CUDA version you are using. For CUDA 12.3, yes, you would definitely need to install following the instructions on the download page.
I guess I should have also pointed out that you can test the “reverse” case on CUDA 12.1, by specifying
CUDA_MODULE_LOADING=LAZY ./my_app
Then if the application hangs that way, on CUDA 12.1 (like it does on CUDA 12.2 or newer) that would be an equivalent indication of possible root cause. But it seems you have already sorted things out. For future readers, this “reverse” test case is only valid on CUDA 11.7 or newer.