Question about CUDA kernels parallel execution

my guess would be cuda lazy module loading. (Also here)

try re-running your test case with

CUDA_MODULE_LOADING=EAGER nsys nvprof --print-gpu-trace kernel-concurrent
1 Like