Triton Inference server : Inference on multi-gpus and load balancing across gpus

mudaseer.muddu740 · November 26, 2024, 7:07am

Inference Tensorrt model on multi GPU’s works as expected, untill if gpus belongs to same gpu family. it loads same model on all the GPU’s by specifying gpus:[0,1] in instance_group (config.pbtxt). only one infer api call is enough to handle the load balancing across GPU’s.

How to handle Inference Tensorrt model on multi GPU’s, if GPUs belongs to different families and how can we achieve optimized load balancing as above?

fanzh · November 26, 2024, 10:03am

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)

• DeepStream Version

• JetPack Version (valid for Jetson only)

• TensorRT Version

• NVIDIA GPU Driver Version (valid for GPU only)

• Issue Type( questions, new requirements, bugs)

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

mudaseer.muddu740 · November 28, 2024, 8:25am

@fanzh, Thank you for your response. My query is more general and not specific to a particular setup or issue. I am exploring how to optimize inference load balancing using TensorRT across GPUs of different families (e.g., combining an Ampere GPU with a Pascal GPU) in a multi-GPU setup.

To clarify:

This is not related to a specific hardware platform or software version but is a conceptual question.
I would like to understand the best practices or general approach for:
    Deploying a TensorRT model across GPUs of different families.
    Ensuring optimized load balancing (similar to how it's done when GPUs belong to the same family).

If such configurations are not natively supported, I would also appreciate insights or suggestions for alternative strategies.

Thank you for your assistance!

fanzh · November 28, 2024, 8:31am

Here is DeepStream forum. triton questions would be outside of DeepStream. I suggest asking in triton forum. Thanks!

Topic		Replies	Views
Triton-server model load balancing DeepStream SDK inference-server-triton	6	965	February 8, 2023
TF-TRT5: How to run tensorflow-tensorrt inferences with multiple GPUs TensorRT	10	3581	September 3, 2019
Is it possible to run multiple TensorRT model inference on a GPU simultaneously and parallelly? TensorRT tensorrt , cuda	3	1950	August 23, 2022
Loading batches with TensorRT python interface TensorRT	5	504	September 8, 2020
Multiple model Inference And Runtime Model Switching Isaac ROS ros , isaac-ros-dnn-inference	3	645	May 13, 2024
Loading a new plan file while running inference TensorRT	1	616	June 9, 2020
How to inference with tensorrt on multi gpus in python TensorRT	2	2143	April 9, 2021
Multi-model parallel inferencing TensorRT	1	370	March 31, 2023
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2465	March 30, 2023
Is there any other way besides TensorRT to increase the GPU utilization while doing the inference? TensorRT	1	743	March 25, 2019

Triton Inference server : Inference on multi-gpus and load balancing across gpus

Related topics