Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server

Originally published at: Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server | NVIDIA Technical Blog

If you’re building unique AI/DL application, you are constantly looking to train and deploy AI models from various frameworks like TensorFlow, PyTorch, TensorRT, and others quickly and effectively. Whether it’s deployment using the cloud, datacenters, or the edge, NVIDIA Triton Inference Server enables developers to deploy trained models from any major framework such as TensorFlow,…

Hi, this is Dhruv. Hope the blog was instructional. Triton Inference Server is something I find myself using very often to deploy models for simple tests as well as production. Being framework agnostic, it’s also really useful for testing off the shelf models for latency/performance and accuracy to make sure it’ll meet my needs. With the integration of Triton with DeepStream, these abilities are now available on NVIDIA dGPU and NVIDIA Jetson with streaming video and edge-to-cloud features. While this blog focuses on deepstream-app as a turnkey solution for IVA, the nvinferserver gstreamer plugin can be used for most models. Furthemore, TF-TRT allows for easy performance optimization with minimal time spent in creating a TensorRT plan so you can prototype and see what kind low hanging fruit can be used to improve performance. Good luck with your IVA projects!

hi @dsingalNV, I see in your blog at Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server | NVIDIA Technical Blog the sample was conducted using FasterRCNN-InceptionV2, but the results are shown for FasterRCNN-InceptionV3, is that correct?, If so, what is the expected performance for FasterRCNN-InceptionV2?. currently for FP32 with no optimization (4 streams and BS=4) I am getting around 4 fps on T4

@virsg the results are for FasterRCNN-InceptionV2, thanks for bringing that issue to my attention. Wrt your performance, with 4 streams and BS=4, if you’re getting 4fps per stream before any optimization, then that performance is in line with what we expect(our single stream example obtained 12.8 fps total). On the other hand, if you’re observing 4fps in all, aka 1 fps for each of your 4 streams, then some things to investigate would be:

  1. make sure you’re using the same inference resolution.
  2. make sure your config file for Triton sets the resolution properly
  3. check if reducing the batch size increases fps if the model is having to wait for 4 frames to be collected before inference is conducted.
  4. profile your system during the run to see if you’re CPU/IO bound. If not, then use the NVIDIA profiler to check GPU activities during the time spent.

Thanks @dsingalNV, I am using the docker image deepstream:5.0.1-20.09-triton. It seems that my results are ok then?, see below:

With num-sources=1 I am getting *PERF: 17.31 (16.85)
With num-sources=4 I am getting **PERF: 4.33 (4.32) 4.33 (4.35) 4.33 (4.32) 4.33 (4.32)

Also some extra questions here:

1- What is the best technique to maximize the gpu utilization with DeepStream and Triton?. I have increased the instance count to 2 but average gpu utilization was ~60% and 5fps per stream, I was expecting to double the gpu utilization as well as the throughput since there were 2 instances of the model loaded into the T4

2- How can I launch Deepstream-Triton server and client separately?

Yes, the performance results youre’ seeing are okay.

  1. deepstream-app is just a simple application to portray the use of DeepStream. In order to get the maximum performance, you would have to look at engineering a video input/inference/post-processing pipeline.
    You would also be served better by using NVIDIA Nsight systems to see what the GPU is actually doing with the multiple instances of the model. Is the inference on the two instances happening in parallel or is only one instance active at a time. In order to get more performance, you can look into translating the model into a TRT-only engine. This get around quite a lot of overhead from TF-TRT. Currently, only two subgraphs of the model have TRT engines generated for them.
  2. To my knowledge there’s no way to do that in DeepStream, since the DeepStream gst-nvstreammux is passing the frames to gst-nvinferserver, thus acting as the client. In case you’re trying to do inference on a cloud hosted instance of DeepStream, you could look into the [souce] field that supports streamed input for inference.

hi @dsingalNV, thanks for your detailed explanation. Here are some extra questions:

1- With the current deepstream-app, how can I control the two instances running in parallel or in sequence?
2- How ca the Deepstream-Triton be separated from the client in the case of running the inference on a data center server?

Thanks for the blog. this is really impressive.

I have followed the steps and installed deepstream in my local system as well as Jetson nano. I am able to run the deepstream on my local system, but when I tried to run nano, I am getting below error, could you help me to resolve this.
deepstream-app: error while loading shared libraries: libnvinfer.so.7: cannot open shared object file: No such file or directory

Hi @dsingalNV, what scripts do you recommend to convert the model to TF-TRT INT8, and also to Native TRT INT8?

This is amazing. All the Nvdia products are well designed and provide good performance.

Hi @virsg, sorry for the late reply. There are three methods I know of to convert your model to TF-TRT or TRT. Some support INT8 and some don’t.

  1. Use the Triton Inference Server’s built in model optimizer for TF models: This enables TF-TRT optimization of the network before inference(although it adds latency to the initial launch) and automatic mixed precision for FP16
  2. Use TF-TRT to generate a savemodel or frozen_graph: Accelerating Inference In TF-TRT User Guide :: NVIDIA Deep Learning Frameworks Documentation and then quantize it: Accelerating Inference In TF-TRT User Guide :: NVIDIA Deep Learning Frameworks Documentation
  3. Use TRT to generate a standalone engine; Accelerating Inference In TF-TRT User Guide :: NVIDIA Deep Learning Frameworks Documentation

@sk.ahmed401 that looks like an error with the DeepStream installation. Was it resolved or are you still looking for help?

1 Like

@dsingalNV there’s a way to have such architecture :

  1. a triton inference server with models preloaded
  2. one or multiple instances of deepstream reading streams and sending for inferencing on the server then getting results back?

thankyou, William

Yes, you can have the triton server hosted elsewhere and communicate with it through gRPC if you use the gRPC option for nvdsinferserver.
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvinferserver.html