My question/confusion is fairly general and spans a number of the forum categories. I’m posting it here because of my end goal. My end goal is to convert an existing, custom pytorch model into one the can be run in evaluate/inference mode at higher possible speeds on the AGX Xavier (with awareness that using lower FP or INT precision may accelerate further, with some performance degradation).
I’ve been looking as ONNX and TensorRT, but getting confused as the differences or overlap between them. And then CUDA can confuse matters further. Here is a list of tools/packages/etc. that I’m struggling to fully distinguish:
onnx
onnxruntime
tensorrt
tensorrt engine
cuda
Much of my confusion comes when reading ONNX documentation. For example, it seems that one can use the python onnx package to convert a pytorch model into an onnx format model. And then, use the onnxruntime to infer with that model, taking advantage of whatever hardware platform might be available to you. However, it seems onnxruntime is not independent of tensorrt (in recent versions), but rather, utilizes it. As stated in this NVIDIA announcement, “You can also use ONNX Runtime with the TensorRT libraries by building the Python package from the source.” And yes, at that linked Microsoft site you find instructions on how to build a “ONNX Runtime Python wheel” but you may also “optionally build with experimental TensorRT support.”
Okay, so far that’s not terribly confusing. But this NVIDIA documentation states: “TensorRT can be used as a library within a user application. It includes parsers for importing existing models from Caffe, ONNX, or TensorFlow, and C++ and Python APIs for building models programmatically.”
Hmmm. So should I use the onnxruntime to run an onnx model on Xavier, or should I used TensorRT to run an onnx model?
And finally, the official onnxruntime download site, here, has binaries and source available for download. In the menu of items you can chose for a particular binary or build/install instructions, there is a “Hardware Acceleration” set of options from which you can choose one. Among them are included (a) CUDA and (b) TensorRT. Doesn’t TensorRT utilize CUDA in some fashion? When/why would I choose the CUDA acceleration over the TensorRT acceleration, or vice versa?
Can anyone add some clarity to all of this?
My particular model was trained in PyTorch, with CUDA support. Should I be converting it into onnx and then trying to run/infer on the Xavier using onnxruntime? Using TensorRT? Something else? Some combination?
Thanks for any comments/advice!