Execution time dominated by 'Delayed Execution'

Hi. I have a windows 11 machine, using Visual Studio 2022, Nsight Systems 2023.1.2, GeForce 3080Ti, driver is 531.14, using CUDA 12.1.

I have a program, and I’ve been suspecting performance issue. This program has many kernel calls, one right after another, ALL data is on the GPU, there is no copying back and forth between the device and host during the main loop. I have profiled the program with Nsight Systems.

It seems like ‘blocked state’ is absolutely filled with ‘Delay Execution’. I just can’t seem to figure out what is going on here. I’ve obscured some names, but the kernels which are getting delayed are nothing special compared to any of the other kernels, which execute very fast. Sometimes, even very simple data moving kernels get delayed.

Can someone help me understand what is going on here? Thanks so much!

for (int i=0; i < MANY; i++) {
FuncA(data);
FuncB(data);
}

FuncA(data) {
kernel1<<<123,138>>>(data);
kernel2<<<123,138>>>(data);

kernel10<<<123,138>>>(data);
kernel11<<<123,138>>>(data);
}

FuncB similar to FuncA.

@jasoncohen

Thanks. I should also note, this program launches kernels from the host AND device. I’ve since read that maybe Nsight Systems doesn’t support dynamic parallelism, so perhaps that is the issue?

I’ve also tried the program in Ubuntu. The program has the same behavior and runtime, but Nsight Systems does NOT show the red blocked state. The pattern of execution looks similar, though.