MPS enabled process is slower than MPS disabled process.

System Config:
NVIDIA TITAN XP. Cuda compilation tools, release 9.0, V9.0.176

I ran two models (RNN and CNN) as two processes on a single GPU core in 1) default mode of computability (thread parallelization) and 2) with enabling multi-process service (MPS).

  1. shows lower run-time compared to 2). My understanding was that MPS enabled Kernel level parallelism and hence I’d expect 2) to be faster than 1). Can someone please let know if I am missing something and why I observe 1) to be faster than 2). Am I missing some additional configuration that need enabling with these modes?

Not my area of expertise. Robert Crovella probably can provide better insights. Two thoughts to ponder:

(1) If each process can keep the GPU busy by itself (seem likely for a deep-learning application), kernel parallelism cannot increase overall throughput

(2) But MPS adds overhead (multiple processes share a resource, namely a GPU context; this requires coordination)