Triton-server model load balancing

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) T4, RTX4000
• DeepStream Version 6.1.1
• JetPack Version (valid for Jetson only)
• TensorRT Version 8.1.2
• NVIDIA GPU Driver Version (valid for GPU only) 515.65.01
• Issue Type( questions, new requirements, bugs) questions
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)


I am using triton-inference server on single GPU server and I cannot fully utilise my GPU due to one of model is slow.

is there a way to scale individual model in triton-server?

one of model is slow ==> do you run it with TensorRT? How did you find it’s slow?

scale individual model in triton-server ==> sorry, what do you mean by ‘scale’?

It is python pre-process model. I want to run multiple instance of same model. If supported, does triton-server manage number of request to any instance?

Hi @dilip.patel

does triton-server manage number of request to any instance?

Yes, it’s supported. For this, you can use batch inference mode also.
What backend will you use for Triton inference?

I will use python backend.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Pytorch bbackend supports dynamic batch, so you can build the model with max batch, and send request to Triton server/pytorch backend with any batch less than the max batch

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.