Parallelization of kernels without MPS

Dear all,
in the terms of experimenting with parallel kernels and processes with CUDA MPS, I am facing the following situation.

I developed a process with cuda kernel that executes in 5 seconds on a V100 gpu. Now, I want to execute the same process 10 times in parallel. That means that I am running 10 same processes (with the same kernel - 5 seconds) in parallel.

Without CUDA MPS, the processes seem to run simultaneously (observed in nvidia-smi) but all in 28 seconds (more than the execution time of the single process) like:
1st process: 28 sec
2nd process: 28 sec
.
.
.

and with CUDA MPS the processes are running simultaneously in 20 seconds:
1st process: 20 sec
2nd process: 20 sec
.
.
.

Note that the time concern only the kernel execution time.

My question is:
Why without CUDA MPS the processes seem to run simultaneously? I was expecting to have an obvious seirialization like:
1st process: 5 sec
2nd process: 10 sec
.
.
.

This actually is what I had as output in a much smaller gpu.
Is it because of the Hyper Q? Or we take the advantage of Hyper Q only by enabling the CUDA MPS server?

Thank you very much in advance!

The interprocess scheduler has changed over time. Some evidence of this (even in the non-MPS case) can be gleaned by reading the CUDA MPS manual.

On Volta, the scheduler is a time-slicing scheduler, as opposed to a round-robin-to-completion context switcher.

Your test case provides evidence of this. It is also described here, in the UPDATE comments on the answer:

https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app/34711344#34711344

As a result of the time-slicing scheduler, the processes appear to run concurrently (at least at some level of inspection/granularity). However, in the non-MPS case, only one process has access to the GPU at any given instant of time, so the net effect is still that your processes are serializing in some fashion.

Thank you for your response!
Isn’t it because of the Hyper-Q?

As I understoond, Hyper-Q enables the concurrent execution of many kernels (maximum 32) without using any MPS service. In this case, MPS improves the resource management of the concurrent kernels in a better way and gives us the ability to maximize the number of concurrent kernels (maximum 48).

Am I correct?

No, that is not correct. Hyper-Q does not enable the concurrent execution of kernels from separate processes. MPS is required for that.

The reason is as I stated.

Thank you!

From the Hyper-Q documentation released from NVIDIA I found this:

“Hyper-Q enables multiple CPU threads or processes to launch work on a single GPU
simultaneously, thereby dramatically increasing GPU utilization and slashing CPU idle times.”

Doesn’t it mean that Hyper-Q enable the concurrent execution of kernels from separate processes?

It does not. MPS is required to witness kernel concurrency, when the kernels are launched from separate processes.

MPS is only supported on devices of compute capability 3.5 or higher, which coincidentally are the devices that include Hyper-Q. Therefore, if you wanted to say “Hyper-Q is a necessary condition for concurrent kernel execution from separate processes” I would agree with you. However if you wanted to say “Hyper-Q is a sufficient condition for concurrent kernel execution from separate processes” I would not agree with you.