Clarity needed on differences between acceleration frameworks/runtimes for AGX Xavier

My question/confusion is fairly general and spans a number of the forum categories. I’m posting it here because of my end goal. My end goal is to convert an existing, custom pytorch model into one the can be run in evaluate/inference mode at higher possible speeds on the AGX Xavier (with awareness that using lower FP or INT precision may accelerate further, with some performance degradation).

I’ve been looking as ONNX and TensorRT, but getting confused as the differences or overlap between them. And then CUDA can confuse matters further. Here is a list of tools/packages/etc. that I’m struggling to fully distinguish:

onnx
onnxruntime
tensorrt
tensorrt engine
cuda

Much of my confusion comes when reading ONNX documentation. For example, it seems that one can use the python onnx package to convert a pytorch model into an onnx format model. And then, use the onnxruntime to infer with that model, taking advantage of whatever hardware platform might be available to you. However, it seems onnxruntime is not independent of tensorrt (in recent versions), but rather, utilizes it. As stated in this NVIDIA announcement, “You can also use ONNX Runtime with the TensorRT libraries by building the Python package from the source.” And yes, at that linked Microsoft site you find instructions on how to build a “ONNX Runtime Python wheel” but you may also “optionally build with experimental TensorRT support.”

Okay, so far that’s not terribly confusing. But this NVIDIA documentation states: “TensorRT can be used as a library within a user application. It includes parsers for importing existing models from Caffe, ONNX, or TensorFlow, and C++ and Python APIs for building models programmatically.”

Hmmm. So should I use the onnxruntime to run an onnx model on Xavier, or should I used TensorRT to run an onnx model?

And finally, the official onnxruntime download site, here, has binaries and source available for download. In the menu of items you can chose for a particular binary or build/install instructions, there is a “Hardware Acceleration” set of options from which you can choose one. Among them are included (a) CUDA and (b) TensorRT. Doesn’t TensorRT utilize CUDA in some fashion? When/why would I choose the CUDA acceleration over the TensorRT acceleration, or vice versa?

Can anyone add some clarity to all of this?

My particular model was trained in PyTorch, with CUDA support. Should I be converting it into onnx and then trying to run/infer on the Xavier using onnxruntime? Using TensorRT? Something else? Some combination?

Thanks for any comments/advice!

Hi,

1.

ONNX is an intermediate DNN model format.
TensorRT engine is the serialized file of TensorRT algorithm.

And the onnxruntime and TensorRT is the corresponding inference library for each format.

2.

The main difference is that onnxruntime is implemented by a third-party company while TensorRT is designed by nVIDIA.

So you may see some choice between CUDA or TensorRT.
CUDA usually refers to the GPU implementation from the library owner.

3.

Another confusing is onnxruntime-TRT vs. TensorRT
onnxruntime-TRT indicates that the overall interface is onnxruntime.
But it will use TensorRT for some supported layer and handles the data transfer between two libraries.

Same idea can also be found in TF-TRT and TRTorch.

4.

Usually, there are some performance gain when using TensorRT.
Since it will choose a fast algorithm based on GPU architecture.
This is much more important for a Jetson device.

However, TensorRT may not support all the operations used in the ONNX.
And the serialized engine doesn’t support portability.

Thanks.

Thanks, @AastaLLL! That’s certainly clears things up a bit. Though some murkiness remains.

Can you comment a little further on CUDA (your #2)? When/why would someone install an onnxruntime build that utilizes the CUDA hardware accelerator rather than the TensorRT accelerator? Is it because CUDA supports operations that TensorRT does not (perhaps related to your #4)?

It sounds like the approach I should take is:

  1. If all the operations of my PyTorch model are supported by TensorRT, convert the model to ONNX format and then use the TensorRT inference engine to run it. In this case, do I have to convert the ONNX model to a serialized TensorRT engine file first? Can I convert from a PyTorch model directly to a serialized TensorRT engine file, skipping the conversion to an ONNX model?
  2. If some operations of the PyTorch model are not supported by TensorRT, convert the model to ONNX format and then use the onnxruntime inference engine to run it. In this case, should I build/use the onnxruntime engine that has (a) CUDA hardware acceleration or (b) TensorRT hardware acceleration? How can I determine which one is appropriate for my model, if using the AGX Xavier?

Am I missing anything in this proposed approach?

Many Thanks,
Matt

Hi,

YES. Since TensorRT targets for inference, not all the operations are available in TensorRT.

1. ONNX format is officially supported in TensorRT.
But it is possible to convert a .pth model to TensorRT engine directly:

2. You can try TRTorch.
TRTorch embeds TensorRT to the PyTorch library.
It deploys a layer with TensorRT and fallback the non-supported layer to the PyTorch implementation:
https://github.com/NVIDIA/TRTorch

Thanks.