Gap between kernel execution

I am currently profiling an application which I run on a TK1 (L4T R21.5, Ubuntu). The application runs multiple kernels. The CPU is synchronized at the end of all kernels.
I have the TK1 set to max CPU (all CPU cores on, forced at max frequency). I have also set the GPU to force to max frequency, the problem remains.
I can see a 2-3ms gap between some of the kernels executed, see attached picture. The CPU is well ahead and waiting to synchronize.

Do you any idea what the GPU is doing during that time? Any suggestion to debug this? I couldn’t find any reason in the kernels.

Marc

Have you tried setting different scheduling behaviours via cudaSetDeviceFlags(), particularly cudaDeviceScheduleSpin, cudaDeviceScheduleYield and cudaDeviceScheduleBlockingSync?

Since this happens on a Tegra platform, you may also want to ask in the dedicated Tegra forum “next door”, in case this is an issue related to the physically unified memory on that platform.

tera, thank you for your tip.

Setting cudaDeviceScheduleSpin removes the gap and increases the processing speed. Unfortunately it also increases the power consumption by 1.5W (or 20%).

The gap remains when setting cudaDeviceScheduleYield or cudaDeviceScheduleBlockingSync. I thought once the cuda kernel execution is scheduled, the CPU isn’t used anymore until all scheduled kernel executions are terminated. But it looks like the CPU is still used to execute the kernels?

I thought once the cuda kernel execution is scheduled, the CPU isn’t
used anymore until all scheduled kernel executions are terminated.

It’s used to check whether all scheduled kernel executions are terminated. ;) That’s what a spinloop does.

It minimizes latencies (as you’ve observed), at the cost of increased CPU load.

Christian

How much speedup do you get from the use of cudaDeviceScheduleSpin? For this use case, do you have to watch out for power draw, or energy consumption (e.g. battery operated equipment)? In many cases of race-to-sleep scenarios, the solution with the higher power draw is actually more efficient energetically.

You could try to hack your own custom spinloop by using cudaStreamQuery() on the running kernel to see when it’s done and using your own (short) sleep (likely with usleep()) before polling again.

Alternatively, if you have an estimate of the duration of the kernel before you execute it, you could submit the kernel, sleep for that expected duration (or slightly less), THEN synchronize with the power-hungry spinloop.

The two ideas could be used at once.

These are both a bit hacky, but could be effective alternatives.

Thanks for those inputs. For now I will live with the gap knowing where it comes from. If I need to further speed up, I see the options.