Cuda illegal memory uses error while running test application in DS

Hi,

Description

Model Used :
Detector : PeopleNet (Resnet34 )
Classification : Resnet50 (classification-1) , Resnet18 (classification-2) and Resnet18 (classification-3).

I have used deepstream-imagedata-multistream test application and customize its pipeline according to my use case. The application works fine till 2 hours but after that it gives the illegal memory uses error.
Error screen shot is mentioned below.

Environment

• Hardware Platform (GPU): NVIDIA 2080 TI
• DeepStream Version: 5.0
• NVIDIA GPU Driver Version : 450.102.04
• Issue Type: question

How we can avoid this error and run the test application continuously ?
If more details are needed please let me know.

Hi @Pritam ,
Could you rin nvidia-smi to monitor the memorys usage during your application is running to check if CUDA memory consumption increases continuously,

Hi,

We have checked the memory in between it does not rises after using 3.5 GB of GPU memory out of 11GB, it seems not be an issue. We have observed same problem again with low memory usage. Please refer logs for attached if they provide some more clue.surveillance_gateway.err.log (2.0 MB)

It majorly gives Corrupted double linked-list error, like following errors:

nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (400) invalid resource handle.
corrupted double-linked list

nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (709) context is destroyed.
corrupted double-linked list

nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (400) invalid resource handle.
nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (400) invalid resource handle.
corrupted size vs. prev_size

nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (700) an illegal memory access was encountered.
corrupted double-linked list

nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (46) all CUDA-capable devices are busy or unavailable.
free(): corrupted unsorted chunks

GPUassert: an illegal memory access was encountered src/modules/cuDCF/cudaCropScaleInTexture2D.cu 1254
corrupted double-linked list

GPUassert: an illegal memory access was encountered src/modules/NvDCF/NvDCF.cpp 3461
corrupted double-linked list

from the log. the earliset failure log is log below, seems the CUDA conext is destroyed when the CUDA task is still running, could you check why & where CUDA conext is destroyed?

(python3:23215): GStreamer-CRITICAL **: 13:34:21.452: gst_buffer_get_sizes_range: assertion 'GST_IS_BUFFER (buffer)' failed
nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (709) context is destroyed.

Hi,
I am also facing the same issue, I have customized the deepstream-test python apps. It runs 24x7 with around 10-15 rtsp streams. We are getting this issue after 5-6 hours of run.

As you have suggested to check the CUDA context state, can you please elaborate how we can do the same? sorry if I am asking too basic thing, as I am not a pro at this.

OK, could you share the kernel log captured with below steps?

  1. reproduce this issue

$ sudo nvidia-bug-report.sh

and share the nvidia-bug-report.log.gz with us

Thanks!