Cannot see concurrent kenrel execution by stream

Yongsk · November 15, 2017, 3:56pm

I am trying to get concurrent execution of independent kernels by cudaStream, but I keep having none.

So, for sanity check I compiled some example codes provided in “Professional CUDA C programming”.

Unlike shown in the book, I cannot see any concurrency between independent kernels.

I am using nvvp to check this out and here’s code.

// Depth-first ordering
for(int i=0; i<n_Streams; i++)
{
    kernel_1<<grid, block, 0, streams[i]>>>();
    kernel_2<<grid, block, 0, streams[i]>>>();
    kernel_3<<grid, block, 0, streams[i]>>>();
    kernel_4<<grid, block, 0, streams[i]>>>();
}

// Breathe-first ordering
for(int i=0; i<n_Streams; i++)
    kernel_1<<grid, block, 0, streams[i]>>>();
for(int i=0; i<n_Streams; i++)
    kernel_2<<grid, block, 0, streams[i]>>>();
for(int i=0; i<n_Streams; i++)
    kernel_3<<grid, block, 0, streams[i]>>>();
for(int i=0; i<n_Streams; i++)
    kernel_4<<grid, block, 0, streams[i]>>>();

Is there anything I should set up to get concurrency between independent kernels?
Thank you in advance.

Robert_Crovella · November 15, 2017, 6:27pm

Try running the CUDA concurrentKernels sample code. When using the profilers, make sure the options to profile concurrent kernels is enabled.

Since all the kernels launched in the example you give are launched into the same stream per iteration, no concurrency will be witnessed for kernels in the same iteration, with respect to each other. Kernels in subsequent iterations may not run concurrently if the GPU is fully occupied with other kernels from previous iterations.

Greg · November 16, 2017, 3:36am

In the attached images you cut off the timeline axis so it is not possible to determine the duration. From the images it is can been seen that the CPU launch overhead exceeds the kernel duration so you will not be able to achieve concurrent execution unless you increase the duration of the kernel.

The first recommendation is to increase the duration of each kernel. The Fermi - Volta the CWD (compute work distributor) will full distribute all thread blocks from one kernel before processing the next kernel (assuming all kernels are launched with equal priority). If the kernel launch saturate the GPU resources then concurrency will only be observed at the end of a kernel as SM resources are freed.

Topic		Replies	Views
Concurrent executions of streams CUDA Programming and Performance	6	444	December 19, 2022
Kernels executing concurrently in different streams do not behave as expected CUDA Programming and Performance	6	405	December 20, 2023
Is this strange behaviour with kernel concurrency? CUDA Programming and Performance	5	969	September 3, 2015
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2263	October 26, 2016
concurrent kernel execution using stream CUDA Programming and Performance	1	568	March 22, 2016
My streams are not running concurrently CUDA Programming and Performance	7	1814	March 6, 2018
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5561	April 28, 2012
CUDA Streams: Start at the same time CUDA Programming and Performance	3	646	November 12, 2021
Concurrent kernel execution CUDA Programming and Performance	2	413	March 26, 2024
Kernels not running concurrently in different dedicated streams CUDA Programming and Performance	3	80	April 29, 2025

Cannot see concurrent kenrel execution by stream

Related topics