As the picture above, the cuModuleLoadData function keeps waiting for the kernel execution ending and I don’t know why.
A module load will affect the device memory map of the GPU, so a device synchronization wouldn’t surprise me.
I don’t know of any method to call in a non-blocking way. The usual suggestion is to call such operations in the early part of your code, and avoid calling them during carefully structured work-issuance loops.
If that’s not possible for some reason, I don’t have any further suggestions. You can always file a bug requesting an enhancement to CUDA. If you file a bug, you may be asked for a complete test case/demonstrator.