I am measuring the RAM used during inference using a classification model (Inception_v1), TensorRT and PyCUDA.
I have used a page-locked memory to allocate the tensors.
To get the Memory used, I have used /proc/meminfo. Also, I am reading the memory used after the predictions have been sent back from GPU to CPU (as you can see in the image).
The TX2 is using around 3.5GB of RAM. I made a plot of it.
Why is that amount of memory being used?
Could it be less? Maybe if I had used the Unified Memory?
Is it related to the amount of RAM used while optimizing the .onnx file with “trtexec”? I used this command: /usr/src/tensorrt/bin/trtexec --onnx=inception_v1_2016_08_28_frozen.onnx --saveEngine=inception_v1_2016_08_28_fp16.trt --workspace=4096 --fp16
Please notes that it takes around 600up MiB memory to load the TensorRT library.
And the buffer for input/output and the inference weights also requires memory.
It’s possible to limit the inference workspace memory used to store the intermediate data.
This will limit TensorRT to select the inference algorithm that can fit into the given memory limitation.
You can do this by adjusting the workspace value directly.
And the buffer for input/output and the inference weights also requires memory.
Can you explain me this? If I had used the Unified Memory will there still be memory requirement for the input/output buffers?
Based on your sample, there is a memory copy step to copy h_input into d_input. d_input and d_output should be the buffer prepared for TensorRT inference?