How many streams? Maximum number of streams

Hi,

is there a limitation on how many streams I can use in an application?
If I increase the number of streams in the simpleStreams example from the Cuda SDK to 256, the test fails.

I want to use several hundreds of streams in my application. Is that possible? And how expensive is stream creation?

Thanks in advance.

I’m curious about this too. Anyone with working experience on this?

I also want to know the limitation of cudastream

I have seen applications that create ~50 streams successfully. As far as I know there is no published or specified limit.

For typical use cases that I am familiar with, it is not needed to create more than ~5 streams per device in order to get maximum copy/compute overlap, and maximum throughput.

Stream creation does use up resources, although the details are not published/specified. Stream creation does take some time, this can be measured for your setting (GPU, CUDA version, OS) with a profiler, using an API trace mode. On a recent test, cudaStreamCreate took ~10us on CentOS 7, Tesla V100, CUDA 11.4. cudaStreamDestroy took a similar amount of time.

Additional recommendations about stream usage can be found in the CUDA Concurrency section of this online training series.

nice material

You can create a very large number of streams. However, if you create more streams that connections (default = 8) then streams will alias to connections which can result in false dependencies. The number of connections can be set from 1-32 using the environment variable CUDA_DEVICE_MAX_CONNECTIONS. Connections are much more expensive than streams.

@Greg : Is there some reservation of connections by contexts, or could a single context theoretically hog all the connections?

This is the per context limit. The device limit is much higher so a single context cannot consume all connections.

1 Like

what is a “connection” ?

from here:

CUDA_DEVICE_MAX_CONNECTIONS

Sets the number of compute and copy engine concurrent connections (work queues) from the host to each device of compute capability 3.5 and above.

When you issue work to the GPU (e.g. a kernel launch or cudaMemcpy operation) it uses a “connection” to get to the GPU.

Does that mean that only amount of streams up to MAX_CONNECTIONS can be truly independent ?

what greg said ^^