Cuda Streams and multiple processes

My question is related to streams and multiple processes using the same GPU for execution.

Let’s assume CUDA MPS service is turned on, so the context is being shared between 2 processes P1 and P2. To maximize the overlap of data transfer and compute, as well as completely avoid doing cudaDeviceSynchronize() we start using streams in our code.

Q) How many streams are supported on Maxwell, Pascal and Turing architecture? I am guessing this is linked to how many kernels can be concurrently executed on the GPU. If yes, then what are those numbers for the mentioned architectures?

Q) I have seen code samples using these two ways of using streams:

  1. kernel<<< blocks, threads, bytes, 0 >>>(); // stream 0
  2. kernel<<< blocks, threads, bytes, stream1 >>>(); // where stream1 is object of cudaStream_t
    Which one is the recommended way? Are they both the same? If two processes use the same stream number 0. Does that mean that the two process will start queuing instructions onto the same stream?

Hi,

>>How many streams are supported on Maxwell, Pascal and Turing architecture?
[SKA] No.of streams limitation should be function of GPU resource utilized by each stream invoked. Streams provide the programmer an opportunity to break down the data transfers into chunks and overlap w/ GPU compute execution (D2H and Kernel time). Thereby the overall time reduces.
Also, Streams allows to launch multiple kernels. Thereby the utilization of GPU resource can be maximized.

>> I have seen code samples using these two ways of using streams
kernel<<< blocks, threads, bytes, 0 >>>(); // stream 0
[SKA] NULL pointer as stream_id refers to the default stream. Any other int as stream_id should throw an error. Only cudaStream_t objects are acceptable.

error: argument of type “int” is incompatible with parameter of type “CUstream_st *”

kernel<<< blocks, threads, bytes, stream1 >>>(); // where stream1 is object of cudaStream_t
[SKA] All concurrent streams should be instantiated using cudaStreamCreate.
cudaStream_t objects enables the management of streams instantiated.
You can add stream call backs to know its completion, it facilitates a finer barrier synchronization.

Related Blog: https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/