I was experimenting with MPS and it did improve the performance when single task doesn’t saturate GPU.
I’m wondering is there a performance-level reason MPS is not default when running multiple processes?
Or is it just convention to give all the resource to one process and finish it ASAP?
Also, what’s the difference between using MPS to share hardware and using multi-cuda-stream in terms of scheduling and workload balancing?
Thanks!