According to the TensorRT documentation, you can expect high host memory (RAM) usage during the build phase. Then lower host memory usage during runtime. This is what’s I’d expect, as inference should mostly use device (GPU) memory. This is also corroborated here, which implies I can expect a fixed amount of host memory usage at runtime, the variable amount is in the build stage.
However this is not something I’ve experienced using the TensorRT library. Our system uses relatively large amounts (~4GB) of host memory during runtime. I’ve also been able to replicate this using trtexec. We can actually observe a slight increase in host memory usage, after the build stage has finished using trtexec.
Here’s the trtexec example, you can use any onnx model:
/usr/src/tensorrt/bin/trtexec --useSpinWait --fp16 --timingCacheFile=/home/user/.cache --onnx=/home/user/model.onnx --duration=20
I’ve also tried deserializing the runtime from an .engine plan file instead, to see if it was a result of the ONNX building stage. But it still uses the same amount of host memory during runtime.
This is running on a dGPU, not jetson, so the host/device memory are separate. We are running a large/complex model with a large workspace, in case this is a factor on runtime host memory usage.
- Is it expected that large amounts of host memory can be consumed, whilst running a model? (after it has been built).
- What are the factors that increase the host memory consumption at runtime, is it the same as the build stage (i.e. model complexity/workspace size/etc)?
- Why is this necessary to use large amount of host memory after the model is built?
TensorRT Version: 8.0.0
GPU Type: NVIDIA GeForce RTX 3070 Laptop GPU
Operating System + Version: Ubuntu 20.04