Kernels executing concurrently in different streams do not behave as expected

867909454 · December 18, 2023, 10:10am

I execute a series of kernels on multiple streams.

for (int i = 0; i < stream_count; i++)
    {
        kernel_1<<<grid, block, 0, streams[i]>>>(d_data);
        kernel_2<<<grid, block, 0, streams[i]>>>(d_data);
        kernel_3<<<grid, block, 0, streams[i]>>>(d_data);
        kernel_4<<<grid, block, 0, streams[i]>>>(d_data);
    }

These kernels are same except function name. I used nsys to analyze the execution of these kernels, and the results were beyond expectation.

But when I execute only one kernel in each stream, it behaves as expected.

To verify, I tried to execute two kernels in each stream, and it did not behave as expected as executing four kernels.

Is there any compiler mechanism involved in this? It seems that all kernels are executed once in a certain stream before they can be executed concurrently in other streams. But this does not seem to be the case when only one kernel is executed.

867909454 · December 18, 2023, 10:11am

My platform is Ubuntu 22.04, CUDA version is 12.2, GPU is RTX 4070

striker159 · December 18, 2023, 10:27am

CUDA does not give any guarantees about overlapping kernels in independent streams.

867909454 · December 19, 2023, 6:22am

Thanks for reply. So there’s no special mechanism at work here? Because I have seen others achieve the simultaneous concurrency of kernels in each stream.

Robert_Crovella · December 19, 2023, 2:56pm

There is no way that kernels issued into the same stream will be concurrent with each other. That is contrary to stream semantics.

certainly with respect to e.g. streams 14, 15, and 16 in your first picture, that sort of behavior seems to be what I would expect in the best case.

It’s not really clear what pattern you are expecting.

If your concern about the first picture has to do with stream 13, it might be that you are hitting some sort of initialization effect such as lazy loading. You could try that test case running with

CUDA_MODULE_LOADING=EAGER ./myapp

867909454 · December 20, 2023, 1:31pm

Thank you. Your reply perfectly solved my problem. The picture below shows what I expected.

Thanks again.

system · January 3, 2024, 1:32pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cannot see concurrent kenrel execution by stream CUDA Programming and Performance	2	570	November 16, 2017
Separate kernel grids do not execute concurrent CUDA Programming and Performance	1	3275	December 18, 2009
Distinct Kernels on Concurrent Streams? CUDA Programming and Performance	3	1252	June 9, 2009
CUDA 4.0 concurrent kernels CUDA Programming and Performance	6	1733	March 28, 2011
Do kernels/streams execute concurrently? CUDA Programming and Performance	1	1212	October 15, 2008
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4659	February 6, 2009
Concurrent Kernel Execution on Fermi - confussion CUDA Programming and Performance	13	1746	October 10, 2011
Parallel execution of multiple kernels possible? CUDA Programming and Performance	1	1671	June 4, 2008
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2295	October 26, 2016
Multiple simultaneous kernels across different streams CUDA Programming and Performance	3	4594	February 3, 2009

Kernels executing concurrently in different streams do not behave as expected

Related topics