Hi all,
I have this simple classification neural network (input size 1x64x128x3, so batch size 1):
Its float16 weights are 1.6MB. I’m trying to run it on a memory-constrained Jetson device next to a bunch of a more demanding neural networks. When I run it using ONNXRuntime with the CPUExecutionProvider, the increase in memory usage for the system from before running the network and after is 12 MB. When I use the CUDAExecutionProvider, the memory increase is 990 MB. Not great! So I tried using TensorRT, excluding the tactics I do not need:
/usr/src/tensorrt/bin/trtexec --onnx=classifier.onnx --saveEngine=classifier.trt --fp16 --tacticSources=-CUDNN,-CUBLAS,-CUBLAS_LT
When I use this TensorRT engine, the increase in memory usage is 385 MB. Much better than the 990 MB using ONNXRuntime, but still a lot more than the 12 MB used when I run it on CPU. What gives? I’ve seen people on these forums argue it’s because of the CUDA libraries that need to be loaded, but when I first load another neural network in the same Python process + thread, the memory increase remains the same. And I don’t see any reason for the CUDA libraries to be loaded twice.
For TensorRT inference in Python, I use pycuda based on the provided Nvidia samples in e.g. /usr/src/tensorrt/samples/python/efficientnet. I’m running Jetpack 4.6.3 for this test. Most of the 385 MB memory increase (336 MB) occurs during this line:
with trt.Runtime(self.__class__.TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(data)
whereas the TensorRT engine file loaded is just 3.8MB.
Can anyone elucidate where the difference in memory usage comes from between CPU and GPU inference, and if there is anything to be done about it?
Btw, here is the verbose trtexec log for building the engine if it helps:
trtexec_verbose.log (525.8 KB)