Overlapping kernel computing with stream per (CPU) thread, slow kernel launches

Hey all,

I have the following problem:

I have relatively small kernels that finish within the order of microseconds (smallest 2us) and don’t occupy a lot of threads. To keep all threads of the GPU (TITAN X Pascal) busy, I decided to make use of streams. Many of the kernels are independent from eachother and can run in parallel in different streams. However, launching the kernels takes 5 to 10 microseconds, which is longer than most kernels take to complete. Therefore, the CPU is not able to launch kernels fast enough to keep up with the GPU to allow for concurrent kernel computation in multiple streams. Only if some longer kernels (the ones in blue and turquoise) are launched which keep the GPU busy for some time (longest 67 us), the CPU is able to launch multiple kernels and ‘‘build up’’ the number of kernels ready to execute. This can be the image below, where after the longer kernels finish running there is some stream concurrency. So the problem is that the kernel launches are too slow for the number of kernels ready to execute to build up and be executed once a part of the GPU is idle.

I thought I’d solve this problem by using multiple CPU threads to launch the kernels. Each CPU thread gets is own stream by using the -default-stream per-thread NVCC flag. However, it turns out that multiple threads launching kernels on the GPU increases the kernel launch time. For example, four CPU threads result in kernel launch times of 40 to 50 us. This pretty much leaves me with the same problem: kernels are not launched fast enough to enable GPU stream concurrency. As you can see there is not a lot of concurrency going on for the smaller kernels:

The NVIDIA profiler user guide says the following about the runtime API bar in de visual profiler:

Runtime API
A timeline will contain a Runtime API row for each CPU thread that performs a CUDA Runtime API call. Each interval in the row represents the duration of the call on the corresponding thread.

My question is: why does the kernel launch time (and possibly the duration of other calls) increase with the use of more CPU threads?

P.S. I’m using CUDA 8

That is entirely as expected. Launch latency has been of that order of magnitude for the past 10 years.

If your kernels have run times of similar magnitude, you may want to seriously think about how to give each kernel more work to do, because the next generation of GPUs (Volta), due out next year, is going to be even faster than the current Pascal generation, exacerbating any problems you are seeing now.

While it is of no help in reducing launch latency (a question of the physical host / device interface), general host-side overhead can be minimized by choosing CPUs with high single-thread performance. My current recommendation is to use CPUs with a base frequency > 3.5 GHz.

BTW, what’s your operating system? That can have an influence on the launch queuing behavior. If Windows, are you operating the Titan X Pascal in TCC mode?

Hey njuffa,

Thanks for your reply! I’m using a AMD Ryzen 7 1800X (3.6 GHz) so I should be good there. The operating system I’m running is Ubuntu 16.04.

The launch latency was indeed not entirely unexpected. The problem is that is increases to 40 / 50 us when launching kernels from multiple CPU threads. I

Do you have any idea why that would be?

I have no insight into the launch latency increasing with multiple threads, and I cannot see anything useful in the output from the profiler. Someone more skilled in interpreting profiler timeline diagrams may be able to spot something.

I do not know what current best practices are regarding multi-treading (check the Best Practices Guide), but I never use more than one CPU thread per GPU. Perhaps naively, I would expect the use of N threads to result in an N-time increase in observed launch latency on a per thread basis when the threads are served in round-robin fashion.

The only thing to interpret from the profiler (I know it’s kind of messy), is that the second approach still doesn’t result in the concurrent execution of multiple (smaller) kernels. I hoped that the use of 4 CPU threads would allow the kernel launch frequency to go up by 4, thereby allowing to number of kernels ‘‘ready-to-launch’’ to build up, which would result in more stream concurrency during the execution of the smaller kernels.

Unfortunately, I can’t really find anything on this in the best practices guide. I can understand your explanation and probably think it is something like this. However, I was trying to understand how it works exactly; what is the limiting factor in these kernel launches. Is some part of the GPU actively ‘‘busy’’ while processing the request for a kernel launch which results in other requests for kernel launches having to wait until the current one is finished?

As I said, I think the basic limitation is that one can issue at most about 200,000 kernels per second to the GPU, regardless of the number of CPU threads involved. That’s the maximum rate for null-kernels (that take no arguments and do nothing), it is likely a tad slower for kernels that do something.

Someone more familiar with multi-threading may have some ideas. Regardless of what the exact mechanism is, your basic problem won’t go away: Kernel run times are too short, and the way to better utilization is to increase kernel run times by giving each kernel more work. You would want to get to kernel execution times in the millisecond rather than microsecond range.

Thanks for your help. I solved the problem eventually by making the kernels larger (each now processes multiple images in a sequential kind of way), but I was still looking for an explanation why the previous approach did not work (also to put in my report). Do you have a source or reference for the 200.000 kernels per second statement? That might just be all I need!

If anyone else with more multi-threading experience has any ideas, please still elaborate!

That is based on observation since GT200 or thereabouts. No matter what the operating system, what the GPU, or how fast the CPU: one kernel issued every 5 microseconds (so a rate of 200,000 kernel launches per second) is the absolute fastest it will go.

From discussing this with some relevant NVIDIA engineers in the past, I took away the message that there are fundamental hardware limitations involved when the CPU must talk to the GPU across PCIe. While newer PCIe versions have increased memory throughput, they have not improved basic latencies. I guess it’s like DRAM: the throughput goes up all the time, but a basic access latency of about 60-70ns has existed for the past 10 years (or longer, my memory is a bit hazy).

Thanks for your quick answers. I’ll try to find a source for that somewhere to put in the report. The thing is that the profiler makes it seem like each CPU is busy for a longer time in the multithreaded approach, while this might have more to do with the GPU not being able to handle the requests fast enough.

I do not know what the detailed handshake between CPU and GPU looks like during a kernel launch.