Hi, I’m a little confused with the way MPS and multi-stream works. I keep seeing “concurrency” when describing MPS and multi-stream, but don’t know the precise meaning.
The question is: do MPS and multi-stream execute kernels in parallel, i.e., computation resources will be partitioned for different kernels so that computation of different kernels will be done simultaneously? Or just the way of kernel switching, i.e., if one kernel is waiting for the memory data fetching, schedule the next kernel for computation?
Streams only have meaning relative to a particular CPU process. Streams are an important part of the CUDA methodology to arrange for asynchronous concurrency within a process. In order to witness concurrent kernel execution from a single process, stream usage is necessary. This topic is covered in many places and there is a CUDA concurrentKernels sample code you may wish to study.
MPS is, as its name indicates, an inter process mechanism. It is used to help allow 2 or more processes that wish to share a GPU do so more efficiently. One of the features of MPS is that it may, under some circumstances, allow kernels from separate processes to use the GPU simultaneously, including to run concurrently. The CUDA MPS docs (a simple google search will find it for you) will cover more details.
From usage standpoint, the usage of these two features is mostly orthogonal. You can use streams with or without MPS, streams do not depend on MPS for the features they provide. You can use MPS with or without streams (in each process); MPS does not depend on process stream usage for the features it provides.
Neither mechanism guarantees kernel concurrency. They are necessary but not sufficient conditions to witness kernel concurrency in each of their respective scenarios (single process, multi-process).