Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server

jwitsoe · September 26, 2020, 12:07am

Originally published at: Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server | NVIDIA Technical Blog

If you’re building unique AI/DL application, you are constantly looking to train and deploy AI models from various frameworks like TensorFlow, PyTorch, TensorRT, and others quickly and effectively. Whether it’s deployment using the cloud, datacenters, or the edge, NVIDIA Triton Inference Server enables developers to deploy trained models from any major framework such as TensorFlow,…

dsingalNV · October 4, 2020, 3:16am

Hi, this is Dhruv. Hope the blog was instructional. Triton Inference Server is something I find myself using very often to deploy models for simple tests as well as production. Being framework agnostic, it’s also really useful for testing off the shelf models for latency/performance and accuracy to make sure it’ll meet my needs. With the integration of Triton with DeepStream, these abilities are now available on NVIDIA dGPU and NVIDIA Jetson with streaming video and edge-to-cloud features. While this blog focuses on deepstream-app as a turnkey solution for IVA, the nvinferserver gstreamer plugin can be used for most models. Furthemore, TF-TRT allows for easy performance optimization with minimal time spent in creating a TensorRT plan so you can prototype and see what kind low hanging fruit can be used to improve performance. Good luck with your IVA projects!

virsg · November 2, 2020, 11:29pm

hi @dsingalNV, I see in your blog at Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server | NVIDIA Technical Blog the sample was conducted using FasterRCNN-InceptionV2, but the results are shown for FasterRCNN-InceptionV3, is that correct?, If so, what is the expected performance for FasterRCNN-InceptionV2?. currently for FP32 with no optimization (4 streams and BS=4) I am getting around 4 fps on T4

dsingalNV · November 3, 2020, 12:40am

@virsg the results are for FasterRCNN-InceptionV2, thanks for bringing that issue to my attention. Wrt your performance, with 4 streams and BS=4, if you’re getting 4fps per stream before any optimization, then that performance is in line with what we expect(our single stream example obtained 12.8 fps total). On the other hand, if you’re observing 4fps in all, aka 1 fps for each of your 4 streams, then some things to investigate would be:

make sure you’re using the same inference resolution.
make sure your config file for Triton sets the resolution properly
check if reducing the batch size increases fps if the model is having to wait for 4 frames to be collected before inference is conducted.
profile your system during the run to see if you’re CPU/IO bound. If not, then use the NVIDIA profiler to check GPU activities during the time spent.

virsg · November 3, 2020, 1:28am

Thanks @dsingalNV, I am using the docker image deepstream:5.0.1-20.09-triton. It seems that my results are ok then?, see below:

With num-sources=1 I am getting *PERF: 17.31 (16.85)
With num-sources=4 I am getting **PERF: 4.33 (4.32) 4.33 (4.35) 4.33 (4.32) 4.33 (4.32)

Also some extra questions here:

1- What is the best technique to maximize the gpu utilization with DeepStream and Triton?. I have increased the instance count to 2 but average gpu utilization was ~60% and 5fps per stream, I was expecting to double the gpu utilization as well as the throughput since there were 2 instances of the model loaded into the T4

2- How can I launch Deepstream-Triton server and client separately?

dsingalNV · November 3, 2020, 3:07am

Yes, the performance results youre’ seeing are okay.

deepstream-app is just a simple application to portray the use of DeepStream. In order to get the maximum performance, you would have to look at engineering a video input/inference/post-processing pipeline.
You would also be served better by using NVIDIA Nsight systems to see what the GPU is actually doing with the multiple instances of the model. Is the inference on the two instances happening in parallel or is only one instance active at a time. In order to get more performance, you can look into translating the model into a TRT-only engine. This get around quite a lot of overhead from TF-TRT. Currently, only two subgraphs of the model have TRT engines generated for them.
To my knowledge there’s no way to do that in DeepStream, since the DeepStream gst-nvstreammux is passing the frames to gst-nvinferserver, thus acting as the client. In case you’re trying to do inference on a cloud hosted instance of DeepStream, you could look into the [souce] field that supports streamed input for inference.

virsg · November 3, 2020, 2:42pm

hi @dsingalNV, thanks for your detailed explanation. Here are some extra questions:

1- With the current deepstream-app, how can I control the two instances running in parallel or in sequence?
2- How ca the Deepstream-Triton be separated from the client in the case of running the inference on a data center server?

sk.ahmed401 · November 4, 2020, 6:05am

Thanks for the blog. this is really impressive.

I have followed the steps and installed deepstream in my local system as well as Jetson nano. I am able to run the deepstream on my local system, but when I tried to run nano, I am getting below error, could you help me to resolve this.
deepstream-app: error while loading shared libraries: libnvinfer.so.7: cannot open shared object file: No such file or directory

virsg · December 4, 2020, 9:47pm

Hi @dsingalNV, what scripts do you recommend to convert the model to TF-TRT INT8, and also to Native TRT INT8?

blogno1.com · December 9, 2020, 5:31pm

This is amazing. All the Nvdia products are well designed and provide good performance.

dsingalNV · April 7, 2021, 3:54am

Hi @virsg, sorry for the late reply. There are three methods I know of to convert your model to TF-TRT or TRT. Some support INT8 and some don’t.

Use the Triton Inference Server’s built in model optimizer for TF models: This enables TF-TRT optimization of the network before inference(although it adds latency to the initial launch) and automatic mixed precision for FP16
Use TF-TRT to generate a savemodel or frozen_graph: Accelerating Inference In TF-TRT User Guide :: NVIDIA Deep Learning Frameworks Documentation and then quantize it: Accelerating Inference In TF-TRT User Guide :: NVIDIA Deep Learning Frameworks Documentation
Use TRT to generate a standalone engine; Accelerating Inference In TF-TRT User Guide :: NVIDIA Deep Learning Frameworks Documentation

dsingalNV · April 7, 2021, 3:55am

@sk.ahmed401 that looks like an error with the DeepStream installation. Was it resolved or are you still looking for help?

w.viscomi · November 26, 2021, 11:07am

@dsingalNV there’s a way to have such architecture :

a triton inference server with models preloaded
one or multiple instances of deepstream reading streams and sending for inferencing on the server then getting results back?

thankyou, William

dsingalNV · May 25, 2022, 10:40am

Yes, you can have the triton server hosted elsewhere and communicate with it through gRPC if you use the gRPC option for nvdsinferserver.
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvinferserver.html

Topic		Replies	Views
Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server DeepStream SDK	3	8916	February 29, 2024
Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly TensorRT	6	994	July 15, 2021
Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly TensorRT tensorrt , tensorflow , python , inference-server-triton , machine-learning	4	951	May 13, 2021
DeepStream, Tensorflow Model Zoo - Incompatibility DeepStream SDK	13	1498	October 12, 2021
Deploying GPT-J and T5 with FasterTransformer and Triton Inference Server Technical Blog	7	1013	April 19, 2023
Deploying AI Deep Learning Models with NVIDIA Triton Inference Server Technical Blog	0	399	December 18, 2020
Profiling and Optimizing Deep Neural Networks with DLProf and PyProf Technical Blog	13	1414	August 11, 2021
Triton server for squad model on P100 with TensorRT 6.0 Triton Inference Server - archived	0	894	June 23, 2020
Some question about Deep stream 5 DeepStream SDK	42	1784	October 12, 2021
Deploy Object Detection TF-TRT INT8 with DS Triton DeepStream SDK inference-server-triton	16	1308	October 12, 2021

Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server

Related topics