Some kernel launch is taking much longer (100x) than others in the same Cuda Stream

Anka_H_Chen · February 9, 2024, 6:25pm

Hi,

I am using CUDA for Finite Element Analysis. My iterative algorithm alternately launches two kernel functions across nine iterations in an asynchronized manner using a CUDA stream, varying the number of thread blocks with each call but maintaining the same functions.
However, as I profiled my program using the Nsight system, I observed that some of those calls are taking much longer than others, some take more than 100us, while others usually take 5us.

Moreover, the kernels that take super long to launch seem to be random. Every time I test it, the kernel calls that take an unusually long time are different, but there are always some kernel calls that take super long.

Those are the results of my Nsight system profiling, the one on top is the overall timeline and the one in the bottom is a zoom-in to the kernel with a large overhead.

Robert_Crovella · February 9, 2024, 6:27pm

A varying number of threadblocks can create a varying execution duration for a kernel. Launching more threadblocks for a given kernel will generally correlate to longer execution times.

with interactive use of the profiler, you can hover your mouse over a given kernel launch to determine the number of blocks in the launch. I can’t do that with posted pictures, however.

Anka_H_Chen · February 9, 2024, 6:32pm

All those calls are asynchronized ones, therefore the actual execution time should not effect the kernel launch time.
I am not calling device synchronization until all the kernels have been launched.

Robert_Crovella · February 9, 2024, 6:54pm

If you have a large number of kernel launches outstanding, then you can hit a queue depth limit. In that case, the async kernel launch process becomes synchronous, ie. the launch call itself takes time until a queue slot opens. I don’t know if that is happening here, the necessary info cannot be determined from the pictures or your description.

If there are in fact two kernels launched in 9 iterations, then there is only 18 kernel launches total, and the queue depth is not likely to be an issue. But your pictures seem to depict more than 18 total kernel launches. Again, hard to be certain based on pictures.

Anka_H_Chen · February 9, 2024, 7:13pm

Thank you for your answer! Yes, my queue can be full because I do not just loop through those 18 calls just once, I usually loop through them a few hundred times. I am pretty sure that I would encounter the queue depth limitation.

But my question is, will this slow me down? If so, what should I do to avoid this problem?

Robert_Crovella · February 9, 2024, 9:25pm

If the “long pole” (i.e. the determinant of application peformance, or application run-time) is the kernel activity (which seems likely to me) then this will not “slow you down”. I would view it as generally a good thing if an application can keep a continuous stream of work ready for the GPU. The alternative would be to have gaps when the GPU is idle. Most folks that I know of don’t consider that to be a good situation.

I’m not aware of anything that could be directly done to avoid this “problem”.

You could get a faster GPU. You could refactor your code to reduce the number of kernel launches. I don’t have a general recipe for that, or anything like that.

Anka_H_Chen · February 9, 2024, 10:09pm

Thank you for your answer! Now I finally understand what’s happening there. As to “getting a faster GPU”, I don’t know if there is a easy way of doing that, because I am using a 4090.

njuffa · February 10, 2024, 12:03am

Aggressive cooling may allow this high-end GPU to operate at the highest possible boost clocks more frequently. From my limited experiments with GPU-accelerated software, you are likely looking at low single-digit percentage application level speedup this way. Most PC hardware (not considering laptops) can operate at ambient temperatures down to 5 deg C / 40 deg F without issues.

If you live in the norther hemisphere and at high enough latitudes, now is the right time of year to try this if you want to stay with air cooling :-) Don’t forget to clean the heatsink / fan assembly from dust. For a more elaborate approach, you could try water cooling.

There does not seem to be a realistic alternative way of boosting hardware performance at this time, so software optimization is the way to go.

Topic		Replies	Views
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3639	October 21, 2017
cost for launching (a lot of) CUDA kernels CUDA Programming and Performance	5	9687	April 15, 2010
Too much time for kernel launch latency CUDA Programming and Performance	9	2246	November 28, 2022
Why Cuda Kernel Launch Takes so much time ？ CUDA Programming and Performance cuda , gstreamer	1	734	November 9, 2023
Running a kernel blocks the CPU? Is it possible to run it asynchronously? CUDA Programming and Performance	2	3485	April 21, 2009
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1425	July 17, 2017
CUDA stream are blocked when luanch lots of kernels (>1000) CUDA Programming and Performance	3	562	December 30, 2018
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3002	January 19, 2018
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	712	October 8, 2016
Very slow kernel launches CUDA Programming and Performance	8	7683	March 28, 2015

Some kernel launch is taking much longer (100x) than others in the same Cuda Stream

Related topics