Unable to Run Parallel Inference on Two GPUs Using Python (Multi-Model, Multi-Queue Setup)

Hello,

I am trying to run parallel inference on two GPUs in a Python application, but I am not able to utilize both GPUs simultaneously.

System Details

  • GPUs: NVIDIA RTX 5080 (x2)

  • OS: Windows 11

  • Framework: PyTorch 2.x + CUDA 12.4

  • Models: 6 YOLO models (custom trained)

  • Python: 3.10

My Setup

I have 6 models total:

  • GPU 0: Model A, B, C

  • GPU 1: Model D, E, F

Images are coming from my real-time inspection system.
Image name indicates which model should process it, e.g.:

body_21112025.jpg → goes to body model on GPU 0
neck_21112025.jpg → goes to neck model on GPU 1

Current Behavior

Only one GPU is being utilized at a time.

Example scenario:

  • If an image for GPU0 arrives, GPU1 stays idle.

  • If an image for GPU1 arrives, GPU0 stays idle.

  • Even when I receive images for both GPUs at the same moment, one GPU waits for the other to finish.

So the workload alternates between GPUs instead of running in true parallel.

What I Want

I want to pull two images from a queue (e.g., one for GPU0, one for GPU1)
and run inference on both GPUs at the same time.

Example desired behavior:

  • Image1 → GPU0 model executes

  • Image2 → GPU1 model executes

  • Both should run simultaneously with full utilization.

What I Have Tried

  • Using Python threads

  • Using concurrent.futures.ThreadPoolExecutor

  • Using multiprocessing

  • Setting torch.cuda.set_device()

  • Manually assigning each model to a specific GPU

  • Ensuring all models .to("cuda:0") or .to("cuda:1")

But no matter what I do, the inference becomes serialized instead of parallel.

My Question

What is the correct way in Python / PyTorch to:

  1. Run inference on two GPUs in true parallel,

  2. While using different models on each GPU,

  3. And processing image streams arriving at the same time?

Do I need:

  • Separate CUDA contexts?

  • Separate Python processes per GPU?

  • Any special PyTorch configuration?

  • A different strategy for queue handling?

Any advice, examples, or best practices for multi-GPU parallel inference would be extremely helpful.

Thank you.

Hi there @kongondamallesh, welcome to the NVIDIA developer forums.

I think this question might be better answered by our CUDA community, so for now I will move this post over there.

There was also a live stream not too long ago about this topic, but I am not sure if it is easily applicable to your use-case.

Thanks!

Easier to use Triton Inference Server and load the models on different GPUs and then perform async inference.

Thank you @MarkusHoHo

Thank you for your suggestions,

I would like to know if support is available in any of the following forms:

  • Personal (one-to-one) support

  • Online support

  • Offline (in-person) support

I am open to any of these options and would appreciate your guidance on which one would be most suitable or currently available.

Thank you very much for your time and support. I look forward to your response.

Kind regards,

Mallesh kongonda