Lauch of repeated CUDA kernels in 'for-loop'

Hello everyone.

While developing an algorithm and conducting it, I have found that host latency required for launching CUDA kernels in ‘for-loop’ cannot be ignored. The latency for launching CUDA kernels or APIs is important in the algorithm since the simulation is considering real-time operation. If launching kernels takes more time than sample period, then the algorithm might not be able to run in real-time.

Is there any way of launching CUDA kernels or CUDA APIs in ‘for-loop’, including cusolver, cublas, and cuspares, only one time before getting into the loop? If there is any unclear points in my questions, please let me know and I will make the question right.

Thank you for your help and time.

Best regards.

You can launch CUDA kernels in a for-loop, or before a for-loop, or both.

The lower bound for launch overhead of CUDA kernels on reasonably fast systems without broken driver models (WDDM) is 5 microseconds. That number has been constant for the past ten years, so I wouldn’t expect it to change anytime soon. This consists of a hardware component related to PCIe transport, plus a software component (code running on the host).

To minimize the latter:

(1) Use Linux, or Windows with a TCC driver (only supported with some GPUs!)
(2) Use a CPU with high single-thread performance (my usual recommendation is >= 3.5 GHz base frequency for CPUs with up to eight cores).

GPUs are designed as throughput machines. If you have tight latency requirements (e.g. high-speed trading of securities), GPUs might not be the preferred solution.

Thanks for your kind answers, Robert_Crovella and njuffa.

May I ask further about the following comment from Robert?

As I understand the comment above, would it be possible to launch CUDA kernels or APIs, including cublas and cusolver, which are repeated called in ‘for-loop’, outside the ‘for-loop’ and maintain those in memory, making kernels do their calculations without being launched by host for each iteration?

Thank you for your help.

Best regards

CUDA kernels can be launched from the device (see “dynamic parallelism”). However, such launches are not noticeably faster than ones launched from the host, from any data I have ever seen.

A hypothetical explanation for this is that device-side launches inject the necessary data into the same internal GPU processing paths that are used by host launches.

So no matter how you slice and dice it, 200K kernel launches per second was speed of light ten years ago, is speed of light today, and will likely remain speed of light for the foreseeable future.

Since GPUs achieve higher computational and memory throughput all the time (albeit much more slowly from now on with the demise of Moore’s Law), CUDA programmers should therefore pack a sufficiently large amount of work into each kernel launch (for example, by batching) to avoid becoming limited by kernel launch overhead going forward.

GPUs are ideal for tasks requiring maximum throughput (this is easily the majority of the application universe today, I would claim). Tasks that require minimum latency with modest working set sizes are best served by high-frequency CPUs with fast and large caches. An example would be a quad-core Xeon W-2125 w/ 8.25 MB cache and 85 GB/sec of system memory bandwidth, operating at a base frequency of 4.0 GHz and boost up to 4.5 GHz. Some people with extreme demands on low-latency processing use aggressive cooling and overclocking to further speed up such platforms.

For a number of applications, combining a low-latency CPU platform for the serial portions of an application with the massive throughput of modern GPUs for the parallelizable portions of the application provides the optimal platform.

Dear njuffa

Thanks again for your kind answer and please be generous about my continueing questions.

I am considering a system with CPU computation and GPU computation together in my algorithm as you mentioned at the end of your answer. To be more specific, what I was trying to figure out was, as you know already, the latency required in the side of host for launching CUDA kernels. Since the algorithm mainly contains ‘for-loop’ arithmetic process with CUDA kernels, I was wondering if there is anyway for resolving this matter, long latency for launching kernels in each iteration.

I presume that you recommended a ‘dynamic parallelism’ as a way of avoiding latency in the host side by making kernels launch subsequent kernels, which is one of absolute idea I think. But I found that CUDA APIs, including MAGMA library, cublas and cusolver, cannot be launched in device side, which I am using mainly in the algorithm.

According to the response from Robert, CUDA kernels can be launched outside the ‘for-loop’. And I was wondering again if those CUDA kernels or APIs can be maintained in memory during ‘for-loop’ process after being launched outside the ‘for-loop’ and, if possible, a method of achieving it.

For delivering my situation better, I attached a result of visual profiler upon my algorithm, which is marked on latency of host side for launching kernels and APIs, required for every iteration.

Thank you very much for your help.

Best regards

What I am suggesting is to give each CUDA kernel enough work to do, in which case the launch latency won’t be a performance limiter. Generally speaking you may want to target a kernel run time of 10 milliseconds to 100 milliseconds on the fastest GPUs currently in use (which then means the kernels will run about ten times as long on the slowest GPUs currently in use).

If your application has real-time human interaction requirements, you would want to target the lower end of that range: generally humans experience delays of up to 100 milliseconds as instantaneous.

It is not clear (to me) what this is asking about. The attached profiler diagrams are not helping, either. Clearly data initially copied to a GPU can stay there as long your app (the process that own the data) is running, with kernel after kernel operating on that resident data.

As an advanced technique, you could have a kernel resident on the GPU for a long time (provided that GPU is not used for display purposes), interacting with host code in various ways.

Algorithm I am trying to develop is about matching phase between two signals, which must not face any latency of the host side, longer than pre-defined sampling period.

Does ‘kernel resident on GPU’ mean that CUDA kernels are not necessarily freed or closed after computation of kernels are
finished? If that is possible, wouldn’t it be possible too to launch CUDA kernels all before getting into ‘for-loop’ and make them stay in GPU memory so that host does not need to launch them again?

That’s about as specific as “my work deals with objects in circular motion”: Are we talking Large Hadron Collider or Indy 500 :-) I assume matching phase between the two signals involves measuring the phase shift between the signals, and then computing an adjustment to be applied to some hardware? Is the phase-shift computation based on FFT? Is there relevant literature you can point to that outlines the context of this work?

How long is this “pre-defined sampling period”? By how much is your current code missing the required time limit? Is there some sort of feedback loop involved (rather than a simple pipeline) that requires results to be presented for the current time slice prior to the start of the next time slice? If there is a feedback loop, could it be extended to cover multiple time slices? What GPU are you running on?

Signal processing is not an area where I have in-depth expertise, but I am aware of various kinds of soft real-time signal processing tasks being performed with CUDA, in a variety of application areas, from simple audio processing to fairly complex synthetic aperture radar or MRI. I don’t recall ever encountering launch latency as a limiting factor in any of these (which doesn’t mean that cannot happen, of course).

Some random thoughts since I don’t know what it is you are trying to accomplish:

(1) You may have to re-think your hardware setup
(2) You may have to re-think your software design
(3) The task at hand may be a poor match for GPU acceleration