Yes, and no, probably. A crisp clear answer would require a lot more information, essentially to the point of saying “profile the code and discover the answer yourself”.
At a high level, yes, they would run concurrently, as that is what MPS enables. Work is submitted to the GPU as if it emanated from a single process. There shouldn’t be any exceptional delays in switching from work issued from one process to work issued from another process.
But work on a GPU, to a large extent, means running CUDA kernels. And many people, when they use the word concurrent often mean “are kernels running concurrently” ie. at the same instant in time. Theoretically, MPS allows for the possibility of kernel concurrency from separate processes, but it does not guarantee it. It is a necessary but not sufficient condition.
Whether or not kernels will actually run concurrently is not answerable from what you have shown. In a nutshell, a single inference request (i.e.
model.inference()) is going to run at least 1 and probably multiple kernels in sequence on the GPU. The details here will vary depending on what you are using for inference back-end, whether it is using an “ordinary” framework like TF, or TF/TRT, or TRT directly, or Triton. All 4 of those cases, for effectively the same inference request, may look different “under the hood” i.e. from the view of a profiler, with respect to what kernels are running, and when.
Once we have multiple of these requests, its possible that we witness no kernel concurrency for at least a couple reasons that I can think of:
kernels may not be launched at precisely the same time, even though the “inference is called in each process at the same time”. A single inference request, as already discussed, may require multiple kernel launches to perform. These kernel launches may not be back-to-back, resulting in gaps in between. Depending on the exact launch pattern between two or more separate requests, you may not witness (kernel) concurrency.
If a specific kernel is large enough to “fill” the T4 GPU, you may not witness kernel concurrency. There simply is no “room” on the GPU for another kernel to run at the same time. The sizes of the kernel launches (number of blocks, threads, etc.) wouldn’t be evident unless you did a large amount of code study including for libraries like cuDNN and TRT which are not open-source, and you had a specific example (e.g. resnet50). But no one in their right mind is going to approach the problem that way (except maybe library designers). The rational approach is to let the profiler give you all this information.
The GPU is a throughput machine, and although latency is important when doing DL inference, I would start by addressing the throughput side first. Triton can help there. It can help by making efficient use of the GPU in the presence of multiple requests by efficiently scheduling or batching those requests. Furthermore, it doesn’t require MPS to work, and all these questions around coarse process-level concurrency are not an issue at the point of inference.