Cannot see concurrent kenrel execution by stream

I am trying to get concurrent execution of independent kernels by cudaStream, but I keep having none.

So, for sanity check I compiled some example codes provided in “Professional CUDA C programming”.

Unlike shown in the book, I cannot see any concurrency between independent kernels.

I am using nvvp to check this out and here’s code.

// Depth-first ordering
for(int i=0; i<n_Streams; i++)
{
    kernel_1<<grid, block, 0, streams[i]>>>();
    kernel_2<<grid, block, 0, streams[i]>>>();
    kernel_3<<grid, block, 0, streams[i]>>>();
    kernel_4<<grid, block, 0, streams[i]>>>();
}

// Breathe-first ordering
for(int i=0; i<n_Streams; i++)
    kernel_1<<grid, block, 0, streams[i]>>>();
for(int i=0; i<n_Streams; i++)
    kernel_2<<grid, block, 0, streams[i]>>>();
for(int i=0; i<n_Streams; i++)
    kernel_3<<grid, block, 0, streams[i]>>>();
for(int i=0; i<n_Streams; i++)
    kernel_4<<grid, block, 0, streams[i]>>>();

Is there anything I should set up to get concurrency between independent kernels?
Thank you in advance.

beathe.png
deapth.png

Try running the CUDA concurrentKernels sample code. When using the profilers, make sure the options to profile concurrent kernels is enabled.

Since all the kernels launched in the example you give are launched into the same stream per iteration, no concurrency will be witnessed for kernels in the same iteration, with respect to each other. Kernels in subsequent iterations may not run concurrently if the GPU is fully occupied with other kernels from previous iterations.

In the attached images you cut off the timeline axis so it is not possible to determine the duration. From the images it is can been seen that the CPU launch overhead exceeds the kernel duration so you will not be able to achieve concurrent execution unless you increase the duration of the kernel.

The first recommendation is to increase the duration of each kernel. The Fermi - Volta the CWD (compute work distributor) will full distribute all thread blocks from one kernel before processing the next kernel (assuming all kernels are launched with equal priority). If the kernel launch saturate the GPU resources then concurrency will only be observed at the end of a kernel as SM resources are freed.