How many streams? Maximum number of streams

Hi,

is there a limitation on how many streams I can use in an application?
If I increase the number of streams in the simpleStreams example from the Cuda SDK to 256, the test fails.

I want to use several hundreds of streams in my application. Is that possible? And how expensive is stream creation?

Thanks in advance.

1 Like

I’m curious about this too. Anyone with working experience on this?

I also want to know the limitation of cudastream

I have seen applications that create ~50 streams successfully. As far as I know there is no published or specified limit.

For typical use cases that I am familiar with, it is not needed to create more than ~5 streams per device in order to get maximum copy/compute overlap, and maximum throughput.

Stream creation does use up resources, although the details are not published/specified. Stream creation does take some time, this can be measured for your setting (GPU, CUDA version, OS) with a profiler, using an API trace mode. On a recent test, cudaStreamCreate took ~10us on CentOS 7, Tesla V100, CUDA 11.4. cudaStreamDestroy took a similar amount of time.

Additional recommendations about stream usage can be found in the CUDA Concurrency section of this online training series.

1 Like

nice material

You can create a very large number of streams. However, if you create more streams that connections (default = 8) then streams will alias to connections which can result in false dependencies. The number of connections can be set from 1-32 using the environment variable CUDA_DEVICE_MAX_CONNECTIONS. Connections are much more expensive than streams.

2 Likes

@Greg : Is there some reservation of connections by contexts, or could a single context theoretically hog all the connections?

This is the per context limit. The device limit is much higher so a single context cannot consume all connections.

1 Like

what is a “connection” ?

from here:

CUDA_DEVICE_MAX_CONNECTIONS

Sets the number of compute and copy engine concurrent connections (work queues) from the host to each device of compute capability 3.5 and above.

When you issue work to the GPU (e.g. a kernel launch or cudaMemcpy operation) it uses a “connection” to get to the GPU.

1 Like

Does that mean that only amount of streams up to MAX_CONNECTIONS can be truly independent ?

what greg said ^^

Can we get the per-context and/or per-device numbers using the Driver API? I don’t seem to find it among the attributes.

There is no practical limit to the number of streams you can create (at least 1000). However, there is a limit to the number of streams that can be effectively implemented concurrently.
In Fermi, the architecture supports 16 concurrent kernel booting, but only one connection from the host to the GPU. So even if you have 16 CUDA streams, they will end up in an HW queue. This can create false data dependencies and limit the amount of concurrency one can easily obtain.
With Kepler, the number of connections between the host and GPU is now 32 (instead of Fermi connections). With the new Hyper-Q technology, it’s now easier to keep Gpus busy with concurrent work.

1 Like

I am not aware of a method to query (a) the current setting, (b) the context maximum number of connections, or (c) the device maximum number of connections. If you have a business use case for this information I recommend you file a request for enhancement (RFE) through the bug system.

Hi Greg,

I am also interested in sergeyn’s question and hope you can explain it:
Does that mean that only amount of streams up to MAX_CONNECTIONS can be truly independent ?

After MAX_CONNECTION number of streams there is a chance of a false dependency in the input stream that serializes the execution of the command buffer.

There are multiple methods to increase concurrency; however, all methods are limited by the current Compute Work Distributor that focuses on rasterizing 1 grid (kernel) before moving to the next and on the GPU Maximum number of resident grids per device.

When running MPS server the Compute Work Distributor can simultaneously launch work from different processes on different groups of SMs.

CDPv2 and CUDA graphs are a good method to increase concurrency from a single launch stream.

1 Like