Unable to see effect of MPS

Hi all,

I’m trying to utilize MPS (Multi-Process Service) on my L4 GPU server, but I’m not experiencing any noticeable benefits. I have two processes, each with a single thread, and both threads are utilizing CUDA to perform inference on a model. When I profile my application with and without MPS, there isn’t a significant difference in the start and end times of the processes. Essentially, it appears that my processes are running in parallel or concurrently even without MPS.

Could this be due to the time-sliced scheduler of the GPU? If so, how can I verify whether time-sliced scheduling is occurring in the absence of MPS, and whether the GPU is context switching between the threads of the two processes?

Additionally, do you possess the MPS profiling statistics regarding GPU utilization during multiple concurrent inferencing of the ResNet50 model? I am currently performing inference using a ResNet-based model, and in my scenario, I have observed that the GPU utilization (checked using nvidia-smi) reaches 100% shortly after initiating two threads.

Could you kindly share any benchmark results pertaining to an L4 GPU for any ResNet-based model, if available?