Execution time dominated by 'Delayed Execution'

Hi. I have a windows 11 machine, using Visual Studio 2022, Nsight Systems 2023.1.2, GeForce 3080Ti, driver is 531.14, using CUDA 12.1.

I have a program, and I’ve been suspecting performance issue. This program has many kernel calls, one right after another, ALL data is on the GPU, there is no copying back and forth between the device and host during the main loop. I have profiled the program with Nsight Systems.

It seems like ‘blocked state’ is absolutely filled with ‘Delay Execution’. I just can’t seem to figure out what is going on here. I’ve obscured some names, but the kernels which are getting delayed are nothing special compared to any of the other kernels, which execute very fast. Sometimes, even very simple data moving kernels get delayed.

Can someone help me understand what is going on here? Thanks so much!

for (int i=0; i < MANY; i++) {
FuncA(data);
FuncB(data);
}

FuncA(data) {
kernel1<<<123,138>>>(data);
kernel2<<<123,138>>>(data);

kernel10<<<123,138>>>(data);
kernel11<<<123,138>>>(data);
}

FuncB similar to FuncA.