Getting "OutOfMemory Error in GpuMemory: 0" from small CNN and small data-set

Hello, my objective is to train a very simple CNN on MNIST using Tensorflow, convert it to TensorRT, and use it to perform inference on the MNIST test set using TensorRT, all on a Jetson Nano, but I am getting several errors and warnings, including “OutOfMemory Error in GpuMemory: 0”. To try and reduce memory footprint, I tried also creating a script where I simply load the TensorRT model (that had already been converted and saved in the previous script) and use it to perform inference on a small subset of the MNIST test set (100 floating point values), but I am still getting the same out of memory error. The entire directory containing the TensorRT model is only 488 KB, and the 100 test points can’t be taking up very much memory, so I am confused about why GPU memory is running out. What could be the reason for this, and how can I solve it?

I am attaching the console output from the 2 Python scripts below as text files. These text files and the Python scripts which generated them can be found on this Gist:
z_console_output.txt (39.9 KB) zz2_trt_infer_console_output.txt (21.8 KB)

Another thing which seems suspicious is that some of the Tensorflow logging info messages are being printed multiple times, EG “Successfully opened dynamic library libcudart”, “Successfully opened dynamic library libcublas”, “ARM64 does not support NUMA - returning NUMA node zero”. What could be the reason for this (EG dynamic libraries being opened over and over again), and could this have something to do with why the GPU memory keeps running out?


Just check your log and it looks like everything works fine. (no error message).
Please let me know if anything missing.

Here is our end-to-end sample for MNIST for your reference:



Bi @AastaLLL , thank you for your reply (and sorry for not getting back to you sooner). You said there are no error messages, but I believe there are error messages (even though inference itself is able to proceed in the end)… in the first file I posted above for example (z_console_output.txt), there are duplicated error message on lines 271 and 272, ending with “OutOfMemory Error in GpuMemory: 0” surrounded by other error messages, EG “Requested amount of GPU memory (1073741824 bytes) could not be allocated. There may not be enough free memory for allocation to succeed”. And it also doesn’t seem right to me that the logging messages include >100 lines telling me which memory addresses are free or in use. There are also various warnings, EG “Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.12GiB”. So one of my questions is: how is it possible that >1 GB is trying to be allocated onto the GPU, when all I’m trying to do is use a model (which is <1 MB) to perform inference on 100x28x28x1 floating point values (which can’t be taking up very much memory)?

Someone answered my similar question on Stack Overflow (it has the same title as this forum post) and offered a solution, which is to limit how much data can be allocated to the GPU using tensorflow config as follows:


I followed their suggestion and set a memory limit of 2048 MB (2 GB) on the memory being allocated to the GPU and ran inference, i will include the console output below, as well as the training and inference scripts. As shown, including this suggestion removed all of the error messages (as well as the 100+ of lines of logs about memory addresses), however, I still have 2 main issues. The first (and major) issue is that inference is taking >30s for a small CNN on 100x28x28x1floating point test images. This is way too high an inference time (for reference, using the same model on Tensorflow on my PC on the full 10000 test images requires 0.802s), I would have expected TensorRT on an Nvidia GPU to be way faster. So what could be the reason that inference is taking so long?

My other issue is that I’m still seeing logging messages about opening dynamic libraries (EG libcudart, libcublas etc) being printed 5 or 6 times each at different points during a single inference session (see console output file below). Surely these dynamic libraries should only be opened once, then they should stay open? Could this have something to do with why inference is taking so long, and so much memory was originally trying to be allocated to the GPU?

console_output.txt (10.8 KB) (2.8 KB) (3.4 KB)


1. It’s known that TensorFlow takes some time to create session.
For better performance, it’s recommended to use pure TensorRT rather than TF-TRT.

More, when creating TensorRT engine, TensorRT will measure the performance of each algorithm and pick a fast one.
This may also take some time but it is just an one time job.
Please remember to serialize the TensorRT engine for the next time usage.

This should depend on the implementation.
The library is loaded when TensorFlow session is created.
If you create/close session for several time(ex. for each frame), it will reload the library multiple time.