When I test inference time with my model, I get better latency when I use the flag --useCudaGraph with trtexec. I see this is an inference flag, so I guess it doesn’t affect the saved engine. So how do I get this same speedup when running the saved engine through the TensorRT Python API?
Please refer to the following document, which may help you:
Thanks for the link, but the only thing I can find on use Cuda graphs is in the C++ API: Developer Guide :: NVIDIA Deep Learning TensorRT Documentation Does that mean the Python API for using Cuda graphs is not yet implemented?
TensorRT does not own the cudaGraph feature.
cudaGraph is offered by CUDA. Please refer to the following document for more information. We are not sure about Python API for it.
Thanks for pointing me in the right direction! For other people reading this topic: pycuda does not yet support graph execution but people are working on it: cuda - CUDA Python 12.0.0 documentation
There is also cuda-python, Nvidia’s own Cuda Python wrapper, which does seem to have graph support: cuda - CUDA Python 12.0.0 documentation So I’ll investigate that next.
Edit: sadly, cuda-python needs Cuda 11.0 or newer, which is not available in Jetpack 4.6.3, which is the newest Jetpack supported on the Jetson TX2 and Jetson Nano. So cuda-python cannot be used to use graph execution there.
seems cupy support cuda graph too.
Is CuPy a direct alternative to pycuda? Would it be possible to implement a TensorRT execution in Python using CuPy? It seems from the documentation it’s a numpy alternative, which might not be exactly the same.
Also I see in my previous message I provided the wrong link. This is the thread in which someone is working on adding graph support to pycuda: [WIP] Add support for CUDA Graphs. by gfokkema · Pull Request #343 · inducer/pycuda · GitHub