my guess would be cuda lazy module loading. (Also here)
try re-running your test case with
CUDA_MODULE_LOADING=EAGER nsys nvprof --print-gpu-trace kernel-concurrent
my guess would be cuda lazy module loading. (Also here)
try re-running your test case with
CUDA_MODULE_LOADING=EAGER nsys nvprof --print-gpu-trace kernel-concurrent