Unable to Run Parallel Inference on Two GPUs Using Python (Multi-Model, Multi-Queue Setup)

kongondamallesh · November 20, 2025, 5:38am

Hello,

I am trying to run parallel inference on two GPUs in a Python application, but I am not able to utilize both GPUs simultaneously.

System Details

GPUs: NVIDIA RTX 5080 (x2)
OS: Windows 11
Framework: PyTorch 2.x + CUDA 12.4
Models: 6 YOLO models (custom trained)
Python: 3.10

My Setup

I have 6 models total:

GPU 0: Model A, B, C
GPU 1: Model D, E, F

Images are coming from my real-time inspection system.
Image name indicates which model should process it, e.g.:

body_21112025.jpg → goes to body model on GPU 0
neck_21112025.jpg → goes to neck model on GPU 1

Current Behavior

Only one GPU is being utilized at a time.

Example scenario:

If an image for GPU0 arrives, GPU1 stays idle.
If an image for GPU1 arrives, GPU0 stays idle.
Even when I receive images for both GPUs at the same moment, one GPU waits for the other to finish.

So the workload alternates between GPUs instead of running in true parallel.

What I Want

I want to pull two images from a queue (e.g., one for GPU0, one for GPU1)
and run inference on both GPUs at the same time.

Example desired behavior:

Image1 → GPU0 model executes
Image2 → GPU1 model executes
Both should run simultaneously with full utilization.

What I Have Tried

Using Python threads
Using concurrent.futures.ThreadPoolExecutor
Using multiprocessing
Setting torch.cuda.set_device()
Manually assigning each model to a specific GPU
Ensuring all models .to("cuda:0") or .to("cuda:1")

But no matter what I do, the inference becomes serialized instead of parallel.

My Question

What is the correct way in Python / PyTorch to:

Run inference on two GPUs in true parallel,
While using different models on each GPU,
And processing image streams arriving at the same time?

Do I need:

Separate CUDA contexts?
Separate Python processes per GPU?
Any special PyTorch configuration?
A different strategy for queue handling?

Any advice, examples, or best practices for multi-GPU parallel inference would be extremely helpful.

Thank you.

MarkusHoHo · November 20, 2025, 12:03pm

Hi there @kongondamallesh, welcome to the NVIDIA developer forums.

I think this question might be better answered by our CUDA community, so for now I will move this post over there.

There was also a live stream not too long ago about this topic, but I am not sure if it is easily applicable to your use-case.

Thanks!

Y-T-G · November 20, 2025, 12:10pm

Easier to use Triton Inference Server and load the models on different GPUs and then perform async inference.

kongondamallesh · November 24, 2025, 5:09am

Thank you @MarkusHoHo

kongondamallesh · December 29, 2025, 6:14am

Thank you for your suggestions,

I would like to know if support is available in any of the following forms:

Personal (one-to-one) support
Online support
Offline (in-person) support

I am open to any of these options and would appreciate your guidance on which one would be most suitable or currently available.

Thank you very much for your time and support. I look forward to your response.

Kind regards,

Mallesh kongonda

Topic		Replies	Views
[PyCuda-Torch] Can i inference concurrency input tensor on One GPU? Frameworks (archived)	0	413	January 22, 2021
How to inference with tensorrt on multi gpus in python TensorRT	2	2264	April 9, 2021
Multi-model inference paralle on jetson agx orin TensorRT	2	160	March 24, 2025
Inference issue queuing up on one GPU TensorRT tensorrt , cuda , cudnn	1	308	May 31, 2024
How to do two different inference with TensorRT on two different GPU on same machine or PC TensorRT	2	552	September 29, 2023
Parallel inference TensorRT cuda , kernel , ubuntu , yolo , python , gpu , parallel-computing	1	1006	November 27, 2022
Is it possible to run multiple TensorRT model inference on a GPU simultaneously and parallelly? TensorRT tensorrt , cuda	3	2213	August 23, 2022
Run inference on a batch of images & parallel inference using cuda on python threads TensorRT tensorrt , cuda	6	2460	January 6, 2022
Multi-model parallel inferencing TensorRT	1	444	March 31, 2023
Separate GPU for Parallel on Jeton AgxOrin Jetson AGX Xavier gpu	19	255	August 28, 2024