GPU vs CPU deep learning memory usage

Hi all,

I have this simple classification neural network (input size 1x64x128x3, so batch size 1):

Its float16 weights are 1.6MB. I’m trying to run it on a memory-constrained Jetson device next to a bunch of a more demanding neural networks. When I run it using ONNXRuntime with the CPUExecutionProvider, the increase in memory usage for the system from before running the network and after is 12 MB. When I use the CUDAExecutionProvider, the memory increase is 990 MB. Not great! So I tried using TensorRT, excluding the tactics I do not need:

/usr/src/tensorrt/bin/trtexec --onnx=classifier.onnx --saveEngine=classifier.trt --fp16 --tacticSources=-CUDNN,-CUBLAS,-CUBLAS_LT

When I use this TensorRT engine, the increase in memory usage is 385 MB. Much better than the 990 MB using ONNXRuntime, but still a lot more than the 12 MB used when I run it on CPU. What gives? I’ve seen people on these forums argue it’s because of the CUDA libraries that need to be loaded, but when I first load another neural network in the same Python process + thread, the memory increase remains the same. And I don’t see any reason for the CUDA libraries to be loaded twice.

For TensorRT inference in Python, I use pycuda based on the provided Nvidia samples in e.g. /usr/src/tensorrt/samples/python/efficientnet. I’m running Jetpack 4.6.3 for this test. Most of the 385 MB memory increase (336 MB) occurs during this line:

with trt.Runtime(self.__class__.TRT_LOGGER) as runtime:
            engine = runtime.deserialize_cuda_engine(data)

whereas the TensorRT engine file loaded is just 3.8MB.

Can anyone elucidate where the difference in memory usage comes from between CPU and GPU inference, and if there is anything to be done about it?

Btw, here is the verbose trtexec log for building the engine if it helps:
trtexec_verbose.log (525.8 KB)

Dear @frederiki3k63
How did you check the memory consumption? Is it using tegrastats?

I parsed /proc/meminfo and compared MemAvailable before and after loading the network. But I just measured it using tegrastats and I get the same results.

If you need me to create and share a standlone end-to-end test script demonstrating this behaviour, including an ONNX file, I could do it and share it with you privately. Let me know. But to me, it seems the behaviour I’m describing always occurs, independently of which small neural network is ran.

Just to add: deserializing the same model twice also gives twice the amount of (V)RAM usage (770 vs 385 MB). According to this NVIDIA employee post this should not happen, which leads me to believe there is something wrong here.

Aaaah I finally figured out what I was doing wrong thanks to this post. I was creating a cuda context per model, whereas I should be having one global cuda context for all models in the same process. I have now made the call to

cfx = cuda.Device(0).make_context()

global, and now the overhead is only once instead of per model. I can now run my model with just 12 MB of added memory, so very similar to the CPU memory usage.

I hope this thread helps someone else who is just as obtuse as me. :D

Edit: or even better: use cfx = cuda.Device(0).retain_primary_context().

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.