Question on Stream, Connection and Performance

CU_Steve · February 23, 2024, 4:46pm

I have no answers, but some related points:

In addition to the limits you mention, there is a limit on the number of resident grids per device.
See Table 18.
If you use dynamic parallelism, you can easily reach the grid limit using only one stream created by the host.

An approach I have been exploring recently:

create only a small number of streams on the host, e.g., one or two streams for each desired stream priority
use a combination of host launches and dynamic parallelism to achieve the concurrency you want

In addition to the links you posted, beware of Increased time to synchronize….

Topic		Replies	Views
How many streams? Maximum number of streams CUDA Programming and Performance	20	8116	January 7, 2025
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	8021	April 5, 2015
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2212	January 18, 2023
Cannot get any stream parallelism. CUDA Programming and Performance	13	1278	December 31, 2019
confusions about CUDA streams CUDA Programming and Performance	5	805	July 30, 2017
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	572	April 6, 2023
How lightweight are cudaStream_t's? CUDA Programming and Performance	6	1128	September 26, 2018
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2097	February 5, 2020
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	799	September 11, 2016
Multi stream multi GPU CUDA Programming and Performance cuda	9	1060	October 6, 2023