Originally published at: https://developer.nvidia.com/blog/end-to-end-ai-for-nvidia-based-pcs-cuda-and-tensorrt-execution-providers-in-onnx-runtime/
This post is the fourth in a series about optimizing end-to-end AI. The last post described the higher-level idea behind ONNX and ONNX Runtime. As explained in the previous post in the End-to-End AI for NVIDIA-Based PCs series, there are multiple execution providers (EPs) in ONNX Runtime that enable the use of hardware-specific features or optimizations…
Thanks for the great blog post. Assuming a previously generated TRT engine, will ONNX with a TensorRT EP achieve the same runtime performance as running the engine directly through TensorRT APIs? In other words, is there any performance penalty to use TensorRT through ONNX runtime?
If your engine is not split up by ONNX Runtime the performance should be the same. Essentially if an ONNX file is not able to compile to a single engine ONNXRuntime will slice up the network and fallback to CUDA Execution provider for unsupported ops.
There are a few things to watch out for:
- TensoRRT in ONNX Runtime is not async by default- meaning you will waste valuable CPU time:
- https://github.com/microsoft/onnxruntime/pull/14088 this shows the difference
- ProViz-AI-Samples/NVIDIAInference.cpp at master · NVIDIA/ProViz-AI-Samples · GitHub this enables async execution as shown in the pull request above
- How do you provide data to TensorRT ? You want to ensure that PCI traffic and running the execution is overlapped by using cuda streams and cuda events. This is in my opinion a little more natural with pure TRT but is certainly possible and demonstrated with ONNX Runtime and demonstrated here: ProViz-AI-Samples/cuda_sample.cpp at master · NVIDIA/ProViz-AI-Samples · GitHub
Thanks for the very important blog.
The provided information is very usefully and clear.
In addition to @mirzadeh question and @maximilianm answer, I am interesting to know if there is way to get a kind of report by the OnnxRuntime which specify the following:
- Number of subgraphs
- Which execution provider is used by each subgraph\layer in case I set both TRT\CUDA EPs and ther is an unsupported operator for example NonZero (for old TRT version)
My question based on the TensorRT native APIs that based on the right usage I can get this kind of information.
I know how the TensorRT graph divided to sub graphs when its Onnx parser encounter an unsupported operator and in case all of them are supported and my platform is Jetson I can know which layer mapped to DLA and which one was falled back to the GPU.
I am looking for this kind of report from the OnnxRuntime…
Thanks,
Hi @orong13, thanks !
ONNX Runtime has a few ways of telling where graphs are separated.
- trt_dump_subgraphs will dump the subgraphs as onyx to disk which makes it nice and easy to debug
- ONNX Runtime’s verbose logging should tell you in detail what is happening and on which EP each node runs
- Then there is also ONNX Runtime’s own profiler. But I believe you will have to build from source to use that. It will basically produce a json file which tells you for each op that is run which are the corresponding ONNX nodes and the EP.
- Last but certainly not least is Nsight systems. If you run that you will see our own library highlighting for TensorRT and by that you can tell that everything around it is CUDA EP. If compiled from source you even get NVTX ranges from ONNX Runtime. I’ll attach a screenshot.
Cheers, keep further questions coming !
Thank you very much @maximilianm for your quick response.
-
Your suggestion about the trt_dump_subgraphs is very useful.
I set it to true and now i have the ability to analyze and process each sub graph independently. -
Regarrding to ONNX Runtime’s verbose logging:
I will learn this issue from this link: Logging & Tracing | onnxruntime -
Regarding the NVIDIA Nsight system:
I succussed to use it and as demonstrated by your screenshot, I successfully generated the same report for my model and analyzed it:
But I couldn’t find low level information about a specific layer which is the root cause problem - NonZero.
Attach are my tested onnxs:
-
Original - superpoint_lightglue_Opset16_IR8_1500SimpInfo.onnx
superpoint_lightglue_Opset16_IR8_1500SimpInfo.zip (40.2 MB) -
Subgraphs:
TensorrtExecutionProvider_TRTKernel_graph_main_graph_4886252126279222944_0_0.onnx
TensorrtExecutionProvider_TRTKernel_graph_main_graph_4886252126279222944_1_1.onnx
TensorrtExecutionProvider_TRT_Subgraph.onnx
Sub_graphs.zip (40.2 MB)If you will analyze the sub graphs you will find that the NonZero layer was removed which is exactly as I expected.
Can you explain please what is the purpose of the file TensorrtExecutionProvider_TRT_Subgraph.onnx?
It seems extactly as the file TensorrtExecutionProvider_TRTKernel_graph_main_graph_4886252126279222944_1_1.onnx,
but its size is much smaller…?Finally,
Do you have idea, reference, an example how to replace NonZero operator in roder to be able to use an old TensorRT 8.2.X version?I know that this TensorRT version plugin interface doesn’t support a dynamic operator which its output shape depends on input content…
If I will not find a TRT solution I will implement it externally using CUDA and integrate it with both sub graphs in order to complete the original model logic.
Thank you!
But I couldn’t find low level information about a specific layer which is the root cause problem - NonZero.
To acquire this layer correlation information you will have to compile ONNX Runtime with NVTX support. Or you are able to tell by the CUDA kernel name which layer is executed. For that you will have to zoom in a lot so see each kernels name.
Do you have idea, reference, an example how to replace NonZero operator in roder to be able to use an old TensorRT 8.2.X version?
No, this is an issue with the operator itself as you already mentioned it is a dynamic output size operator. Maybe you can write a custom implementation of it to provide the NonZero values but pad them with 0’s to a fixed length ? Your approach to implement it externally sounds like the easiest to be honest.

