When I run your code as posted on CUDA 11.6 or CUDA 11.8, I get a report of 81MB, not 390MB.
CUDA uses lazy initialization, so you may not have CUDA fully initialized at the first call to cudaGetMemInfo
. Then when you make the 2nd call, there will be some CUDA overhead.
However I don’t know anywhere that it is claimed that doing a handle destroy will release all library overhead. So I’m pretty confident this is not a bug.
For example, when CUDA loads a library like cusolver, it loads all the kernels in the cusolver library. Destroying a handle doesn’t unload all these kernels.
If you’d like to see a change in CUDA behavior, you can always file a bug, and also you may want to investigate CUDA opt-in (for CUDA 11.7 and 11.8) “lazy” module loading. This will likely reduce the memory footprint.
compile with the following env var set:
CUDA_MODULE_LOADING=LAZY
using CUDA 11.7 or 11.8. However, as I reported, when I test the code you have posted here, I get 81MB, not 390MB, and this switch has no effect on that observation.