Hello,
I am trying to run parallel inference on two GPUs in a Python application, but I am not able to utilize both GPUs simultaneously.
System Details
-
GPUs: NVIDIA RTX 5080 (x2)
-
OS: Windows 11
-
Framework: PyTorch 2.x + CUDA 12.4
-
Models: 6 YOLO models (custom trained)
-
Python: 3.10
My Setup
I have 6 models total:
-
GPU 0: Model A, B, C
-
GPU 1: Model D, E, F
Images are coming from my real-time inspection system.
Image name indicates which model should process it, e.g.:
body_21112025.jpg → goes to body model on GPU 0
neck_21112025.jpg → goes to neck model on GPU 1
Current Behavior
Only one GPU is being utilized at a time.
Example scenario:
-
If an image for GPU0 arrives, GPU1 stays idle.
-
If an image for GPU1 arrives, GPU0 stays idle.
-
Even when I receive images for both GPUs at the same moment, one GPU waits for the other to finish.
So the workload alternates between GPUs instead of running in true parallel.
What I Want
I want to pull two images from a queue (e.g., one for GPU0, one for GPU1)
and run inference on both GPUs at the same time.
Example desired behavior:
-
Image1 → GPU0 model executes
-
Image2 → GPU1 model executes
-
Both should run simultaneously with full utilization.
What I Have Tried
-
Using Python threads
-
Using concurrent.futures.ThreadPoolExecutor
-
Using multiprocessing
-
Setting
torch.cuda.set_device() -
Manually assigning each model to a specific GPU
-
Ensuring all models
.to("cuda:0")or.to("cuda:1")
But no matter what I do, the inference becomes serialized instead of parallel.
My Question
What is the correct way in Python / PyTorch to:
-
Run inference on two GPUs in true parallel,
-
While using different models on each GPU,
-
And processing image streams arriving at the same time?
Do I need:
-
Separate CUDA contexts?
-
Separate Python processes per GPU?
-
Any special PyTorch configuration?
-
A different strategy for queue handling?
Any advice, examples, or best practices for multi-GPU parallel inference would be extremely helpful.
Thank you.