Use customized data training on detectnet_v2:resnet10 model failed out of memory

I have successfully run the example application detectnet_v2:resnet18.
I got this error after I changed the data and model.

that the video memory of the graphics card did not exceed
But 16G memory free changed from 14G to 0

2020-12-09 08:44:08.817698: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-09 08:44:13.079675: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-12-09 08:44:13.079994: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592
2020-12-09 08:44:13.121916: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-12-09 08:44:13.122022: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 7730940928
/usr/local/bin/tlt-train: line 32: 2078 Killed tlt-train-g1 ${PYTHON_ARGS[*]}

this is tlt-train resultres.txt
this is train config(22.4 KB) train.txt (4.0 KB)

Could you please share the output of nvidia-smi ? Is there any other process using GPU?

BTW, please set width/height to multiples of 16.
https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/supported_model_architectures.html#object-detection

1 Like

I use “watch -n 0.3 nvidia-smi” watch the gpu
only have some display process ~170M
the ds app use max 3000M

total 7981M
run ds app and display process use max 3204M
Free ~4000M

thanks

Is the issue fixed? Or could you try to reboot the docker or machine and retry?

BTW, could you run with more GPU memory by killing the DS app?

not solve…,
sorry I typed wrongly not ds app ,that is tlt app
It seems that this error not is about GPU memory…
The maximum GPU memory uses 3G

I can run the example tlt app

Suggest to check below to narrow down.

  1. Check the Hardware/software requirement
    Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation
    Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation

  2. Try to run other detection networks, for example yolo_v3.
    To check if it is a common issue on your side

  3. Try to reboot machine and retry

The images need resize.set width/height to multiples of 16.