VPI_STATUS_OUT_OF_MEMORY vpiDeviceCreate() Failed

• Hardware Platform (Jetson / GPU) T4 GPU
• DeepStream Version Deepstream-5.0
• NVIDIA GPU Driver Version (valid for GPU only) 460.32.03

We are using the Milestone VPS plugin with Deepstream Pipeline that includes different Detection/Classification Models.
we are using Milestone 2020 R2 and VPS From MIPSDK 2020 R1. Milestone is running on a Windows OS while the VPS/Deepstream Pipeline is running on Ubuntu 18.04.

The pipeline (Deepstream) is crashing during runtime due to the error :

When we run the pipeline, all the models are starting to instantiate and start running. Yet at some random point, it crashes and starts giving this error (attached).

Also note that running the command: watch nvidia-smi shows complete info in regards of the GPU, but when the error happens, all the processes running and its metrics (Mem allocated, GPU util, info) disappear (turn into 0) and the GPU info remains showing without any running process.

Can you please help with this?

Please find attached the nvidia-bug-report after running the command:
sudo nvidia-bug-report.sh

We also collected some data lof using nvidia-smi (sampling each 5 sec) which shows that when the error occurs the Util Mem turns to 0.

nvidia-bug-report.log.gz (1.2 MB)

do you mean this failure happens after pipeline has run for some time?

From the log - “VPI_STATUS_OUT_OF_MEMORY”, I suspected this issue happened due to out-of-memory, maybe caused by GPU memory leakage.
Could you use “nvidia-smi” to monitor the memory usage like below?

I think, it’s because application halted, then there is not R/W, so Util Mem turned to 0.


do you mean this failure happens after pipeline has run for some time?

yes, it sometimes runs for several hours before this error is encountered.

Could you use “nvidia-smi” to monitor the memory usage like below?

it is constant, we even monitored it once when the crash happened in front of us. It was on the normal level then it suddenly dropped to 0. This is also assured by the image attached since we are sampling he gpu logs including the memory usage (it can be shown that the column of fb is constant over all the period until the crash happened )

When checking more logs, we noticed that the respective error has 2 forms of logs:
The first is the one attached above.

The second gives more details about the issue. It mentioned, “Failed to enqueue inference batch”. Are these 2 different scenarios/cases?

I don’t know if this may be more descriptive about this issue.

did you run batch processing on the nvDCF? What are the batch number and configuration?


Please note that upgrading the version of deepstream to 5.1 solved the issue. We have not faced any crashing till now. According to our tests, the issue was caused by the NvDCF tracker in version 5.0. When using KLT tracker, no crashing was encountered.