I have seen applications that create ~50 streams successfully. As far as I know there is no published or specified limit.
For typical use cases that I am familiar with, it is not needed to create more than ~5 streams per device in order to get maximum copy/compute overlap, and maximum throughput.
Stream creation does use up resources, although the details are not published/specified. Stream creation does take some time, this can be measured for your setting (GPU, CUDA version, OS) with a profiler, using an API trace mode. On a recent test, cudaStreamCreate took ~10us on CentOS 7, Tesla V100, CUDA 11.4. cudaStreamDestroy took a similar amount of time.
You can create a very large number of streams. However, if you create more streams that connections (default = 8) then streams will alias to connections which can result in false dependencies. The number of connections can be set from 1-32 using the environment variable CUDA_DEVICE_MAX_CONNECTIONS. Connections are much more expensive than streams.
There is no practical limit to the number of streams you can create (at least 1000). However, there is a limit to the number of streams that can be effectively implemented concurrently.
In Fermi, the architecture supports 16 concurrent kernel booting, but only one connection from the host to the GPU. So even if you have 16 CUDA streams, they will end up in an HW queue. This can create false data dependencies and limit the amount of concurrency one can easily obtain.
With Kepler, the number of connections between the host and GPU is now 32 (instead of Fermi connections). With the new Hyper-Q technology, it’s now easier to keep Gpus busy with concurrent work.
I am not aware of a method to query (a) the current setting, (b) the context maximum number of connections, or (c) the device maximum number of connections. If you have a business use case for this information I recommend you file a request for enhancement (RFE) through the bug system.