I’m using TensorRT inference on jetson nano 2GB board, only the device memory allocated by the TensorRT allocator will be released by calling .destroy() . cuDNN, cuBLAS memory would not release. I found some similar topics: CUDA memory release, GPU memory may leak during deserializing the engine on TensorRT 6, but it looks like can not release before the application is terminated.
Due to we need to call the application continuously and 2GB memory is so limited, so how to release cuDNN, cuBLAS memory without terminating the application?
The memory is used for loading the cuDNN/cuBLAS library.
If you are using TensorRT 8.0 (JetPack 4.6), an alternative is to inference the model without using cuDNN.
small question: I convert .onnx model to .trt model using trtexec with --tacticSources=-CUDNN,-CUBLAS, I found inference time looks not increased and the result is correct. Are CUDNN and CUBLAS necessary for inference? any difference?
TensorRT’s dependencies (cuDNN and cuBLAS) can occupy large amounts of device memory. TensorRT allows you to control whether these libraries are used for inference via the TacticSources (C++, Python) attribute in the builder configuration. Note that some operator implementations require these libraries, so that when they’re excluded, the network may not compile.