Some kernel launch is taking much longer (100x) than others in the same Cuda Stream

Hi,

I am using CUDA for Finite Element Analysis. My iterative algorithm alternately launches two kernel functions across nine iterations in an asynchronized manner using a CUDA stream, varying the number of thread blocks with each call but maintaining the same functions.
However, as I profiled my program using the Nsight system, I observed that some of those calls are taking much longer than others, some take more than 100us, while others usually take 5us.

Moreover, the kernels that take super long to launch seem to be random. Every time I test it, the kernel calls that take an unusually long time are different, but there are always some kernel calls that take super long.

Those are the results of my Nsight system profiling, the one on top is the overall timeline and the one in the bottom is a zoom-in to the kernel with a large overhead.


image

A varying number of threadblocks can create a varying execution duration for a kernel. Launching more threadblocks for a given kernel will generally correlate to longer execution times.

with interactive use of the profiler, you can hover your mouse over a given kernel launch to determine the number of blocks in the launch. I can’t do that with posted pictures, however.

All those calls are asynchronized ones, therefore the actual execution time should not effect the kernel launch time.
I am not calling device synchronization until all the kernels have been launched.

If you have a large number of kernel launches outstanding, then you can hit a queue depth limit. In that case, the async kernel launch process becomes synchronous, ie. the launch call itself takes time until a queue slot opens. I don’t know if that is happening here, the necessary info cannot be determined from the pictures or your description.

If there are in fact two kernels launched in 9 iterations, then there is only 18 kernel launches total, and the queue depth is not likely to be an issue. But your pictures seem to depict more than 18 total kernel launches. Again, hard to be certain based on pictures.

Thank you for your answer! Yes, my queue can be full because I do not just loop through those 18 calls just once, I usually loop through them a few hundred times. I am pretty sure that I would encounter the queue depth limitation.

But my question is, will this slow me down? If so, what should I do to avoid this problem?

If the “long pole” (i.e. the determinant of application peformance, or application run-time) is the kernel activity (which seems likely to me) then this will not “slow you down”. I would view it as generally a good thing if an application can keep a continuous stream of work ready for the GPU. The alternative would be to have gaps when the GPU is idle. Most folks that I know of don’t consider that to be a good situation.

I’m not aware of anything that could be directly done to avoid this “problem”.

You could get a faster GPU. You could refactor your code to reduce the number of kernel launches. I don’t have a general recipe for that, or anything like that.

1 Like

Thank you for your answer! Now I finally understand what’s happening there. As to “getting a faster GPU”, I don’t know if there is a easy way of doing that, because I am using a 4090.

Aggressive cooling may allow this high-end GPU to operate at the highest possible boost clocks more frequently. From my limited experiments with GPU-accelerated software, you are likely looking at low single-digit percentage application level speedup this way. Most PC hardware (not considering laptops) can operate at ambient temperatures down to 5 deg C / 40 deg F without issues.

If you live in the norther hemisphere and at high enough latitudes, now is the right time of year to try this if you want to stay with air cooling :-) Don’t forget to clean the heatsink / fan assembly from dust. For a more elaborate approach, you could try water cooling.

There does not seem to be a realistic alternative way of boosting hardware performance at this time, so software optimization is the way to go.