cudaStreamCreate and cudaStreamDestroy overhead?

I am thinking of introducing streams to my cuda application. The easiest way of doing this is to dynamically create and destroy cuda streams. The harder way would be to pre create the streams i will need.

So i my question is what is the overhead of using streams dynamically? Is this a normal thing to do?

i would think that streams are generally rather static, as opposed to dynamic - they remain in use for the duration of execution of a functional block, and are really only changed when the application transitions between functional blocks, as the application adapts to the change in work to be done

in such a context stream overhead should be low, and i do not see how it can be significantly greater than other overhead, like memory allocation for instance

you could probably measure the time it takes to create and destroy a stream; but i would contend the point that streams are created/ destroyed ‘dynamically’

That’s the way i thought about it as well. However when i think about it from a system point of view, explicitly creating or destroying streams is not really necessary.

This is because the device only cares about a single number that indicates the stream. However if there are multiple applications running, both sending and executing kernels on the GPU, then both applications should not use the same streams. Thus i think the cudaCreateStream and cudeDestroyStream operations are only there as a simple management system allocating numbers (streams) to applications such that they don’t accidentally use the same streams. This indicates that dynamically creating streams would be a oke thing to do.

However this is just speculation on my part.

streams are of type cudaStream_t - a structure rather than a single variable
i also do not think the structure contains redundant elements
hence, i would think that one requires at minimum a structure to manage a stream, not merely a number

one can have streams conditional on other streams (cudaStreamWaitEvent), flag when a stream is done, record its starting time and finishing time, etc
and a stream would likely map to a cuda context
this would make me doubt whether a number is sufficient to manage a stream - perhaps from the device’s vantage point, but certainly not from the host’s point of view

streams greatly enhance concurrency; i do not see how one would be able to forward issue work - memory copies and kernels - in a concurrent, asynchronous manner, by merely using numbers
even if streams can be seen as merely numbers, cuda would expect a stream, not a number

@laan, Steven Jones gave a “Tip and Tricks” talk at GTC15.

He noted that streams are cheap to create and destroy and this is actually a useful idiom:

[b]S5530 - Featured Talk: Memory Management Tips, Tricks and Techniques

View PDF
View Recording[/b]