Difference between vGPU and CUDA MPS

Use Case

I have to deploy around 3 inference-pipelines on a single AWS G4dn.xLarge instance that run concurrently. The instance has a single NVIDIA Tesla T4.

Progress So Far

Thanks to Robert Crovella’s amazing answer here and here, I could test using MPS. I understand thanks to the answers above, that many GPUs that came after Kepler, follow different scheduling rules and hence, the experiment may not COMPLETELY replicable on future generations.

Issue

Even with MPS enabled, and 2 inference pipelines running in 2 separate processes, inference (which includes object detection + tracking + a bunch of other models) in one process happens as expected but in the other process, there is considerable lag/frame drops.

Note

A single inference process running normally barely takes 7% of GPU memory.

Questions

  1. A noob question, but since PyTorch works out of the box with MPS (issue), what is the reason behind the inferior performance of one of two processes?

  2. Would NVIDIA vGPUs be better suited for my use case? Are they able to run workloads concurrently without interruption?

very likely the recommendation from NVIDIA would be to use Triton Inference Server for this use-case

Hey Robert, really grateful for your response, definitely gave a lot of new ideas about what I might be doing wrong. While I understand Triton would be the ideal solution, but currently we are running into issues with the conversion of our models into JIT script.

Having said that, I had a certain specific question regarding using MPS to solve this use-case,

Lets say we have a python code that goes like this,

//some code 
while True:
       _, img = cap.read()
      // preprocess on cpu
      output = model.inference(img.cuda()) // model already loaded on gpu
     // postprocess on cpu      

Assuming that the python code is running concurrently in 2 separate processes and MPS is enabled, if model.inference is called in each process at the same time, would the inference run concurrently on the same gpu?

Yes, and no, probably. A crisp clear answer would require a lot more information, essentially to the point of saying “profile the code and discover the answer yourself”.

At a high level, yes, they would run concurrently, as that is what MPS enables. Work is submitted to the GPU as if it emanated from a single process. There shouldn’t be any exceptional delays in switching from work issued from one process to work issued from another process.

But work on a GPU, to a large extent, means running CUDA kernels. And many people, when they use the word concurrent often mean “are kernels running concurrently” ie. at the same instant in time. Theoretically, MPS allows for the possibility of kernel concurrency from separate processes, but it does not guarantee it. It is a necessary but not sufficient condition.

Whether or not kernels will actually run concurrently is not answerable from what you have shown. In a nutshell, a single inference request (i.e. model.inference()) is going to run at least 1 and probably multiple kernels in sequence on the GPU. The details here will vary depending on what you are using for inference back-end, whether it is using an “ordinary” framework like TF, or TF/TRT, or TRT directly, or Triton. All 4 of those cases, for effectively the same inference request, may look different “under the hood” i.e. from the view of a profiler, with respect to what kernels are running, and when.

Once we have multiple of these requests, its possible that we witness no kernel concurrency for at least a couple reasons that I can think of:

  1. kernels may not be launched at precisely the same time, even though the “inference is called in each process at the same time”. A single inference request, as already discussed, may require multiple kernel launches to perform. These kernel launches may not be back-to-back, resulting in gaps in between. Depending on the exact launch pattern between two or more separate requests, you may not witness (kernel) concurrency.

  2. If a specific kernel is large enough to “fill” the T4 GPU, you may not witness kernel concurrency. There simply is no “room” on the GPU for another kernel to run at the same time. The sizes of the kernel launches (number of blocks, threads, etc.) wouldn’t be evident unless you did a large amount of code study including for libraries like cuDNN and TRT which are not open-source, and you had a specific example (e.g. resnet50). But no one in their right mind is going to approach the problem that way (except maybe library designers). The rational approach is to let the profiler give you all this information.

The GPU is a throughput machine, and although latency is important when doing DL inference, I would start by addressing the throughput side first. Triton can help there. It can help by making efficient use of the GPU in the presence of multiple requests by efficiently scheduling or batching those requests. Furthermore, it doesn’t require MPS to work, and all these questions around coarse process-level concurrency are not an issue at the point of inference.

Hey Robert, thank you for taking time out to explain in such great detail! I cannot be more greatful!