How to effectively parallelize cuda kernel launches on CPU

guan.huifeng · January 18, 2018, 10:12pm

My work uses 4 streams and I wish they can be run concurrently. The code is something like this:

for (int i=0; i<N; i++)   //batch numbers
{
    for (int j=0; j<4; j++)
       myCudaCode(stream[j]);    // working codes using the specified stream
}

However, from nvvp profiler I see the streams are actually not concurrently running because the CPU are fully occupied by the kernel launches. I did not use any cudaDeviceSynchronize. You may see the figure from the following link.

I understand that all my kernels on GPU are pretty small, comparable to the kernel launch times on CPU. But so far we do not intend to change them. From the above figure, I see most kernel launches on CPU take around 5~10 us, which is considered normal. The whole processing time for one batch is around 0.4 ms (as shown in the gray)

An intuitive thinking to optimize the code is to use multi-threading to parallelize the CUDA kernel launches on CPU. Here is what I did by use of openMP:

for (int i=0; i<N; i++)
{
    #pragma omp parallel num_threads(4)
       myCudaCode(stream[omp_get_thread_num()]);
}

Now the nvvp profiler shows like this:

The four streams are seemingly running concurrently. However, for each CPU thread, the CUDA kernel launches are not no longer as compact as before, and also are significantly stretched (typically 20~30 us). The resulting times required for one batch processing (shown in gray) are now around 0.5ms, even longer than the single thread case.

I also tried pthread method. It shows the similar problem.

So I’d like to ask for an effective way to parallize the kernel launches on CPU. Ideally, the times are expected be reduced by one fourth.

njuffa · January 18, 2018, 10:27pm

GPU kernels can only run concurrently when each individual kernel does not fill the GPU completely. Are your GPU kernels sufficiently small that they do not fill up the GPU individually?

What OS are you on? If it is Windows and you are using the default WDDM driver, kernel launch batching performed by the driver for performance reasons may prevent the kernels from running concurrently even if they are small enough.

If it were possible to concurrently run kernels that are able to fill the GPU by themselves, the overall throughput of the processing pipeline wouldn’t improve.

Your initial approach of using non-null streams to achieve concurrency is appropriate assuming the kernels are small enough to facilitate that. Starting the kernels from different CPU threads is not helpful as all you are doing is adding host-side overhead.

guan.huifeng · January 18, 2018, 10:45pm

I’m pretty sure every kernel is small enough, far away from the full GPU computing resources.

I’m using Linux, i7 8 cores CPU and GTX 1070 GPU.

I don’t understand why host-side overhead of starting the kernels from different CPU threads will be that large? Though I did not give a zoomed-in figure, actually there are 14 kernels launched in each stream. From the nvvp profiler, not only the first kernel launch of each thread but all the kernel launches undertake significant latency. Also, every kernel launch time changes from 5~10 us to 20-~30 us, which does not seem normal to me.

So if we do not intend to change the kernel functions, did you mean there is no way to effectively parallelize the kernel launches in order to optimize the performance? Because in the future we may want to increase to 16 steams or even more. It is not ideal that all the streams have to be serialized.

njuffa · January 18, 2018, 11:09pm

I don’t know what your kernels look like or how they are configured, so I can’t give feedback on whether they should be able to execute concurrently. CUDA ships with a sample application that demonstrates concurrent kernels (named concurrentKernels), which can also be downloaded individually:

[url]CUDA Samples :: CUDA Toolkit Documentation

Have you had a chance to experiment with that app? I have not tried the profiler with this app, so I don’t know what to expect. The profiler may be intrusive enough to make an observation of concurrently executing kernels unlikely. IMHO, the easiest way forward would be to give that app a try and then use it as a template for your own work.

Robert_Crovella · January 18, 2018, 11:52pm

I noticed that there are a sequence of ~10 kernels running in each stream, in the single thread case.

That doesn’t seem to match the dispatch you’ve indicated:

for (int i=0; i<N; i++)   //batch numbers
{
    for (int j=0; j<4; j++)
       myCudaCode(stream[j]);    // working codes using the specified stream
}

I can’t really match the two. Doesn’t seem to be enough information here, or else what you’ve shown code-wise is not really representative of your actual code in the single thread case.

guan.huifeng · January 19, 2018, 12:07am

Sorry if there is any confusion. For

myCudaCode(stream[j]);

it is not just a kernel function. It is a host function that actually contains a lot stuffs. It contains ~10 kernel calls and some host-side codes. But all the kernel calls are associated with the specific stream with the index j.

Hope it can clarify.

txbob:

I noticed that there are a sequence of ~10 kernels running in each stream, in the single thread case.

That doesn’t seem to match the dispatch you’ve indicated:
for (int i=0; i<N; i++)   //batch numbers
{
    for (int j=0; j<4; j++)
       myCudaCode(stream[j]);    // working codes using the specified stream
}
I can’t really match the two. Doesn’t seem to be enough information here, or else what you’ve shown code-wise is not really representative of your actual code in the single thread case.

guan.huifeng · January 19, 2018, 12:14am

I looked at the sample code. It involves unified memory and therefore the nvvp profiler looks wired and is not easy to interpret. Sorry that I do not have too much knowledge on UM. But I noticed that their kernel launch on CPU are also typically as long as around >30 us.

At a first glance, that sample also used a simple openMP structure which I do not see much difference from mine.

guan.huifeng · January 19, 2018, 1:36am

Update: based on my experiments, it seems the use of multi CPU threads do not reduce the total kernel launch times at all. Suppose the single thread code for processing N streams requires time T, then using openMP and the N-thread code for processing N streams (one-stream for one-thread) will also approximately require time T. As the figure I posted, even though the streams are now seemingly concurrent, but every kernel launch latency becomes also significant. Interestingly, the total time (for one batch or N streams) therefore remains approximately unchanged.

I suspect that openMP+CUDA may only work for multi-GPU computing, but for single-GPU computing it is not working as we expect. Hope the NVIDIA people can check your internal log to see if this problem has been previously reported.

BulatZiganshin · January 19, 2018, 1:42am

it’s not a problem. it’s how things should work. you have only one GPU (or N GPUs, doesn’t matter). Why filling it (them) with work from multiple CPU threads, or even multiple CPUs can change the time that this GPU(s) need to execute kernels? amount of the work is the same, it’s independent on how much CPUs are sending this work to GPU(s)

guan.huifeng · January 19, 2018, 2:06am

You may not have looked into the profiler image I posted. When I used one only CPU thread, the four streams do not overlap, which means we did not fully make use of the GPU computing ability. Since in my work the computational task in each stream is pretty small, so it is very much needed if we can overlap the streams to save time.

Topic		Replies	Views
How to Launch Cuda kernel in different processes CUDA Programming and Performance	8	3706	November 6, 2018
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3660	October 21, 2017
Concurrent kernel execution without stream CUDA Programming and Performance	7	2454	December 28, 2016
low concurrency and low kernel utilization, but kernels are filled. CUDA Programming and Performance	6	1407	November 18, 2018
My streams are not running concurrently CUDA Programming and Performance	7	1768	March 6, 2018
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1449	September 14, 2017
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2569	April 29, 2019
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2237	October 26, 2016
8x GPU app profiles parallel GPU kernel exec in NVVP, but kernels exec serial from cmd line CUDA Programming and Performance	5	561	September 15, 2020
Kernel launch concurrency CUDA Programming and Performance	10	1802	December 11, 2014

How to effectively parallelize cuda kernel launches on CPU

Related topics