Can I call cuModuleLoadData in a Non-blocking way?

As the picture above, the cuModuleLoadData function keeps waiting for the kernel execution ending and I don’t know why.

A module load will affect the device memory map of the GPU, so a device synchronization wouldn’t surprise me.

I don’t know of any method to call in a non-blocking way. The usual suggestion is to call such operations in the early part of your code, and avoid calling them during carefully structured work-issuance loops.

If that’s not possible for some reason, I don’t have any further suggestions. You can always file a bug requesting an enhancement to CUDA. If you file a bug, you may be asked for a complete test case/demonstrator.