I have a job with a number of CUDA kernels that can process data in either single threaded or multi-threaded mode. When multi-threaded, each thread has its own CUDA stream, and all kernels and memory accesses are associated with that stream. There are no mutexes or locks in the code, so the thread are pretty much entirely independent of each other. The job doesn’t make particularly efficient use of the GPU, as the workloads are small.
Alternatively, I can run multiple concurrent single threaded jobs processing the same data using nvidia-cuda-mps-server.
Comparing the two, as the number of threads/processes increases, the scaling with the mps-server is MUCH, MUCH better. Due to the fact that the GPU is not terribly stressed by the job, I can run 10 concurrent processes with the mps-server in about the same time as 1. However using the MT implementation with 10 threads, the a job to process an equivalent amount of data takes about 6 times as long.
How is it that the nvidia-cuda-mps-server is so much better at time slicing the GPU than multiple independent CUDA streams concurrently?