Why CUDA MPS make cudaMalloc faster?

I am using a Pascal 40 GPU and I checked CUDA MPS in order to achieve simultaneous CUDA contexts.
I observed the following:

  1. cudaMalloc becomes much faster when CUDA MPS daemon is up.
  2. even with small kernels (kernels that seem to leave free GPU resources for another cuda process) the processes do not seem to run more concurrently than the non CUDA MPS case.

Could you please explain me why those happen?

Thank you!

Your first cuda call in your program is likely to be faster when the GPU is already initialized. The CUDA MPS daemon may be keeping the GPU initialized. If the first cuda call in your program is cudaMalloc, it may appear to be faster.

You can demonstrate kernel concurrency with CUDA MPS.