Agx Orin - Triton inference

Hello,

I’d like to understand more about how Triton GRPC works. I’ve seen huge usage of network bandwith (around 3Gbps; i’m using 10Gbps switch) for the inference of a segmentation model with an input video stream of 1080p 30FPS video.
I’m trying to be able to make the inference on another Orin to run as smooth as possible.

On a single Orin, I have around 50-55FPS and when running the inference on another Orin (pipeline deepstream on Orin 1; Triton server on Orin2) I have about 20-25 FPS.
I’m trying to increase that number so I need to understand more about GRPC and bottlenecks.

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)

• DeepStream Version

• JetPack Version (valid for Jetson only)

• TensorRT Version

• NVIDIA GPU Driver Version (valid for GPU only)

• Issue Type( questions, new requirements, bugs)

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

I’d like to understand what is the content of the grpc exchanges between a triton client and the server. I’ve noticed quite a a large usage of the bandwith and I’d like to know how to reduce and optimize it.

In grpc model, DeepStream will send the tensors to triton server for inference and get inference results by grpc protocol. please refer to the deepStream doc and triton doc.

Hello fanzh,

I would like to emphasize that I’m on very particular case where I want to perform inference on another Orin.
I need to optimize the bandwith (+ … ? ) to be able to increase the inference speed.

what is the different configurations between test(50-55FPS) and test(20-25 FPS)? is test(50-55FPS) using grpc mode"?

  • Test 1 : Deepstream + Triton Inference on a single Orin: 50-55FPS
  • Test 2 : Deepstream (Orin1) + Triton Inference (Orin 2): 20-25FPS

In both cases, I’m using GRPC with Triton server.
And my config in deepstream is the same. Yes, I’m using “enable_cuda_buffer_sharing=true”.

  1. are you using custom code? or which deepstream sample are you testing? could you share the configuration file and two whole logs? wondering the source type and sink type.
  2. please make sure the two Orin have the same setting. please refer to this topic .
  • are you using custom code? or which deepstream sample are you testing? could you share the configuration file and two whole logs? wondering the source type and sink type.

Yes, I’m using a custom code but it should be similar with the deepstream samples.
The project is under NDA, I’m not able to share it here in public. Is there another way to share it?

  • please make sure the two Orin have the same setting. please refer to this topic .

Yes, they are using the same settings.

  1. “enable_cuda_buffer_sharing=true” is not an acceleration feature for Jetson. please refer to the doc. so the only one difference of two tests is the network transmitting.
  2. are you using the local file or rtsp stream? why do you need to deploy grpc model in two machines? since the tensor is not compressed. trtion inference API AsyncInfer supports compression, but there is no fps improvement after testing.
  3. for a higher fps, you can use interval to skip some frame inferences if not all frames need to be inferred. please find “interval” in the doc.
  • “enable_cuda_buffer_sharing=true” is not an acceleration feature for Jetson. please refer to the doc. so the only one difference of two tests is the network transmitting.

Good to know but from the doc it was not clear.

  • are you using the local file or rtsp stream? why do you need to deploy grpc model in two machines? since the tensor is not compressed. trtion inference API AsyncInfer supports compression, but there is no fps improvement after testing.

The input stream is a local file.
As for the why, I’d be happy to discuss it outside of this forum.

  • for a higher fps, you can use interval to skip some frame inferences if not all frames need to be inferred. please find “interval” in the doc.

I’ll look into this and see if it is applicable .

Sorry for the late reply, Is this still an DeepStream issue to support? Thanks!

Hello fanzh,

That’s not a “real issue” but more performance advice/improvement.

Are there any other solutions ?
Maybe, I’ll rephrase : how is it done on dGPU ? Are there any tricks to increase when triton server is not the same one as deepstream ?

Deepstream nvinferserver and triton code are opensource. the main workflow is the same except the feature enable_cuda_buffer_sharing on dgpu. please refer to nvinferserver doc and triton doc. if using grpc model on Jetson, then tensors transmitting is the bottleneck.

Is the HTTP method more optimized ?
Or what should be my alternatives ?

triton supports http method, but nvinferserver only support native and grpc mode, please refer to the doc.

Ok, then I think I’ll have to change the way to do things to mitigate this issue.

Thanks for the information.