In MPS documentation section 5.1 https://docs.nvidia.com/deploy/mps/index.html we see that to enable MPS, we set the GPU to EXCLUSIVE mode. On doing so, we see that kernels launched from the same user are launched on multiple SMs and run in parallel. When in default mode, they run in time slicing manner.
When we try launching kernels from different users, one of the users kernels is not launched until other users’ kernel is over. Here is where my understanding regarding MPS as seen from Turing documentation is shaky. As Turing is similar to Volta https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf, I believed the statement ‘mps_clients’ means jobs launched from different users! But that is not the case. The jobs are run in a serialized manner. When MPS is disabled, the jobs are submitted, but as answered in another nvidia discussion, they are time shared. https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app.
So in a sense, jobs from 2 different users cannot be run in parallel even if resources are available using MPS on Turing architecture. Multiple User jobs will be serialized?
Thanks in advance.