System Config:
NVIDIA TITAN XP. Cuda compilation tools, release 9.0, V9.0.176
I ran two models (RNN and CNN) as two processes on a single GPU core in 1) default mode of computability (thread parallelization) and 2) with enabling multi-process service (MPS).
- shows lower run-time compared to 2). My understanding was that MPS enabled Kernel level parallelism and hence I’d expect 2) to be faster than 1). Can someone please let know if I am missing something and why I observe 1) to be faster than 2). Am I missing some additional configuration that need enabling with these modes?