End-to-End AI for NVIDIA-Based PCs: CUDA and TensorRT Execution Providers in ONNX Runtime

jwitsoe · February 8, 2023, 7:16pm

Originally published at: https://developer.nvidia.com/blog/end-to-end-ai-for-nvidia-based-pcs-cuda-and-tensorrt-execution-providers-in-onnx-runtime/

This post is the fourth in a series about optimizing end-to-end AI. The last post described the higher-level idea behind ONNX and ONNX Runtime. As explained in the previous post in the End-to-End AI for NVIDIA-Based PCs series, there are multiple execution providers (EPs) in ONNX Runtime that enable the use of hardware-specific features or optimizations…

mirzadeh · March 27, 2023, 8:59pm

Thanks for the great blog post. Assuming a previously generated TRT engine, will ONNX with a TensorRT EP achieve the same runtime performance as running the engine directly through TensorRT APIs? In other words, is there any performance penalty to use TensorRT through ONNX runtime?

maximilianm · April 17, 2023, 4:32pm

If your engine is not split up by ONNX Runtime the performance should be the same. Essentially if an ONNX file is not able to compile to a single engine ONNXRuntime will slice up the network and fallback to CUDA Execution provider for unsupported ops.
There are a few things to watch out for:

TensoRRT in ONNX Runtime is not async by default- meaning you will waste valuable CPU time:
- https://github.com/microsoft/onnxruntime/pull/14088 this shows the difference
- ProViz-AI-Samples/NVIDIAInference.cpp at master · NVIDIA/ProViz-AI-Samples · GitHub this enables async execution as shown in the pull request above
How do you provide data to TensorRT ? You want to ensure that PCI traffic and running the execution is overlapped by using cuda streams and cuda events. This is in my opinion a little more natural with pure TRT but is certainly possible and demonstrated with ONNX Runtime and demonstrated here: ProViz-AI-Samples/cuda_sample.cpp at master · NVIDIA/ProViz-AI-Samples · GitHub

orong13 · October 31, 2024, 7:36am

Thanks for the very important blog.
The provided information is very usefully and clear.

In addition to @mirzadeh question and @maximilianm answer, I am interesting to know if there is way to get a kind of report by the OnnxRuntime which specify the following:

Number of subgraphs
Which execution provider is used by each subgraph\layer in case I set both TRT\CUDA EPs and ther is an unsupported operator for example NonZero (for old TRT version)

My question based on the TensorRT native APIs that based on the right usage I can get this kind of information.
I know how the TensorRT graph divided to sub graphs when its Onnx parser encounter an unsupported operator and in case all of them are supported and my platform is Jetson I can know which layer mapped to DLA and which one was falled back to the GPU.

I am looking for this kind of report from the OnnxRuntime…

Thanks,

maximilianm · October 31, 2024, 9:54am

Hi @orong13, thanks !
ONNX Runtime has a few ways of telling where graphs are separated.

trt_dump_subgraphs will dump the subgraphs as onyx to disk which makes it nice and easy to debug
ONNX Runtime’s verbose logging should tell you in detail what is happening and on which EP each node runs
Then there is also ONNX Runtime’s own profiler. But I believe you will have to build from source to use that. It will basically produce a json file which tells you for each op that is run which are the corresponding ONNX nodes and the EP.
Last but certainly not least is Nsight systems. If you run that you will see our own library highlighting for TensorRT and by that you can tell that everything around it is CUDA EP. If compiled from source you even get NVTX ranges from ONNX Runtime. I’ll attach a screenshot.

Cheers, keep further questions coming !

orong13 · October 31, 2024, 12:11pm

Thank you very much @maximilianm for your quick response.

Your suggestion about the trt_dump_subgraphs is very useful.
I set it to true and now i have the ability to analyze and process each sub graph independently.
Regarrding to ONNX Runtime’s verbose logging:
I will learn this issue from this link: Logging & Tracing | onnxruntime
Regarding the NVIDIA Nsight system:
I succussed to use it and as demonstrated by your screenshot, I successfully generated the same report for my model and analyzed it:

image1433×715 72.5 KB

But I couldn’t find low level information about a specific layer which is the root cause problem - NonZero.

Attach are my tested onnxs:

Original - superpoint_lightglue_Opset16_IR8_1500SimpInfo.onnx
superpoint_lightglue_Opset16_IR8_1500SimpInfo.zip (40.2 MB)
Subgraphs:
TensorrtExecutionProvider_TRTKernel_graph_main_graph_4886252126279222944_0_0.onnx
TensorrtExecutionProvider_TRTKernel_graph_main_graph_4886252126279222944_1_1.onnx
TensorrtExecutionProvider_TRT_Subgraph.onnx
Sub_graphs.zip (40.2 MB)

If you will analyze the sub graphs you will find that the NonZero layer was removed which is exactly as I expected.

Can you explain please what is the purpose of the file TensorrtExecutionProvider_TRT_Subgraph.onnx?
It seems extactly as the file TensorrtExecutionProvider_TRTKernel_graph_main_graph_4886252126279222944_1_1.onnx,
but its size is much smaller…?

Finally,
Do you have idea, reference, an example how to replace NonZero operator in roder to be able to use an old TensorRT 8.2.X version?

I know that this TensorRT version plugin interface doesn’t support a dynamic operator which its output shape depends on input content…

If I will not find a TRT solution I will implement it externally using CUDA and integrate it with both sub graphs in order to complete the original model logic.

Thank you!

maximilianm · October 31, 2024, 1:25pm

But I couldn’t find low level information about a specific layer which is the root cause problem - NonZero.

To acquire this layer correlation information you will have to compile ONNX Runtime with NVTX support. Or you are able to tell by the CUDA kernel name which layer is executed. For that you will have to zoom in a lot so see each kernels name.

Do you have idea, reference, an example how to replace NonZero operator in roder to be able to use an old TensorRT 8.2.X version?

No, this is an issue with the operator itself as you already mentioned it is a dynamic output size operator. Maybe you can write a custom implementation of it to provide the NonZero values but pad them with 0’s to a fixed length ? Your approach to implement it externally sounds like the easiest to be honest.

Topic		Replies	Views
Performance DECREASE with tensorRT under onnxruntime, pt2 Jetson AGX Xavier tensorrt	5	3337	May 25, 2022
What Nvidia GPUs can I use for TensorRT execution provider in ONNX runtime? TensorRT	6	1228	October 12, 2021
Clarity needed on differences between acceleration frameworks/runtimes for AGX Xavier Jetson AGX Xavier tensorrt , cuda , onnx	4	1413	October 18, 2021
Speeding Up Deep Learning Inference Using TensorRT Technical Blog	5	1014	November 9, 2021
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1519	July 12, 2022
Performance DECREASE with tensorRT under onnxruntime Jetson AGX Xavier tensorrt	2	851	March 8, 2022
I am unable to perform inference with the TensorRTExecutionProvider specified in onnxruntime TensorRT tensorrt , cudnn	1	527	November 30, 2023
Tensorrt can not speed up well TensorRT	7	1804	June 29, 2022
Is it ever reasonable to have ONNX Runtime with CUDAExecutionProvider faster than native TensorRT? TensorRT	1	2383	February 3, 2023
TensorRT 10.1: Different inference results of onnxruntime and tensorrt TensorRT	2	249	August 21, 2024

End-to-End AI for NVIDIA-Based PCs: CUDA and TensorRT Execution Providers in ONNX Runtime

Related topics