Deepstream-app crash with nvbufsurface: NvBufSurfaceSysToHWCopy error

Please provide complete information as applicable to your setup.

• Hardware Platform (RTX 2080 Ti)
• DeepStream Version 5.0
• TensorRT Version 7
**• Driver Version: 450.36.06 CUDA Version: 11.0 **

While running deepstream-app with:

  • 4 RTSP Stream sources
  • 4 RTSP Stream sinks (3 Hardware Encoded + 1 Software Encoded, with sink3 synch=1)
  • YOLOV3 FP16 Mode (from objectDetector_Yolo sample)
  • Batch Size of 4 for PGIE, Batch Size of 16 for Secondary PGIEs
  • Secondary inference with Car Make, Car Type, Car Color and Face Detection provided in the sample

Application crashes with different sets of errors:

**PERF: 30.04 (30.02) 30.04 (30.02) 30.04 (30.02) 30.04 (30.02)
**PERF: 29.94 (30.02) 29.94 (30.02) 29.94 (30.02) 29.94 (30.02)
nvbufsurface: NvBufSurfaceSysToHWCopy: failed in mem copy
nvbufsurface: NvBufSurfaceCopy: failed to copy
ERROR: …/nvdsinfer/nvdsinfer_func_utils.cpp:31 [TRT]: …/rtSafe/cuda/cudaSoftMaxRunner.cpp (111) - Cudnn Error in execute: 8 (CUDNN_STATUS_EXECUTION_FAILED)
nvbufsurface: NvBufSurfaceSysToHWCopy: failed in mem copy
ERROR: nvdsinfer_context_impl.cpp:1420 postprocessing cuda waiting event failed , cuda err_no:719, err_str:cudaErrorLaunchFailure
nvbufsurface: NvBufSurfaceCopy: failed to copy
ERROR in BufSurfacecopy
Cuda failure: status=719 in CreateTextureObj at line 2513
nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (719) unspecified launch failure.
1:16:21.839067449 17459 0x5651706964f0 WARN nvinfer gstnvinfer.cpp:1188:gst_nvinfer_input_queue_loop:<secondary_gie_2> error: Failed to queue input batch for inferencing
ERROR: …/nvdsinfer/nvdsinfer_func_utils.cpp:31 [TRT]: FAILED_EXECUTION: std::exception
ERROR: nvdsinfer_backend.cpp:290 Failed to enqueue inference batch
ERROR: nvdsinfer_context_impl.cpp:1408 Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
ERROR from sink_sub_bin_encoder3: Failed to process frame.
Debug info: gstv4l2videoenc.c(1220): gst_v4l2_video_enc_handle_frame (): /GstPipeline:pipeline/GstBin:processing_bin_2/GstBin:sink_bin/GstBin:sink_sub_bin3/nvv4l2h264enc:sink_sub_bin_encoder3:
Maybe be due to not enough memory or failing driver
1:16:21.839169040 17459 0x5651706965e0 WARN nvinfer gstnvinfer.cpp:1188:gst_nvinfer_input_queue_loop:<secondary_gie_1> error: Failed to queue input batch for inferencing
ERROR from secondary_gie_2: Failed to queue input batch for inferencing
Debug info: gstnvinfer.cpp(1188): gst_nvinfer_input_queue_loop (): /GstPipeline:pipeline/GstBin:secondary_gie_bin/GstNvInfer:secondary_gie_2
ERROR from secondary_gie_1: Failed to queue input batch for inferencing
Debug info: gstnvinfer.cpp(1188): gst_nvinfer_input_queue_loop (): /GstPipeline:pipeline/GstBin:secondary_gie_bin/GstNvInfer:secondary_gie_1
Quitting
GDestroying pipelineERROR from sink_sub_bin_queue3: Internal data stream error.
Debug info: gstqueue.c(988): gst_queue_handle_sink_event (): /GstPipeline:pipeline/GstBin:processing_bin_2/GstBin:sink_bin/GstBin:sink_sub_bin3/GstQueue:sink_sub_bin_queue3:
streaming stopped, reason error (-5)
Cuda failure: status=46 in CreateTextureObj at line 2513
nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (46) all CUDA-capable devices are busy or unavailable.
Cuda failure: status=46 in CreateTextureObj at line 2496
nvbufsurftransform.cpp(2369) : getLastCudaError() CUDA error : Recevied NvBufSurfTransformError_Execution_Error : (46) all CUDA-capable devices are busy or unavailable.
Segmentation fault (core dumped)

Another instance:

**PERF: FPS 0 (Avg) FPS 1 (Avg) FPS 2 (Avg) FPS 3 (Avg)
**PERF: 30.04 (30.02) 30.04 (30.02) 30.04 (30.02) 30.04 (30.02)
**PERF: 29.72 (30.02) 29.72 (30.02) 29.72 (30.02) 29.72 (30.02)
**PERF: 30.09 (30.02) 30.09 (30.02) 30.09 (30.02) 30.09 (30.02)
nvbufsurface: NvBufSurfaceSysToHWCopy: failed in mem copy
nvbufsurface: NvBufSurfaceCopy: failed to copy
nvbufsurface: NvBufSurfaceSysToHWCopy: failed in mem copy
nvbufsurface: NvBufSurfaceCopy: failed to copy
ERROR in BufSurfacecopy
nvbufsurface: NvBufSurfaceSysToHWCopy: failed in mem copy
nvbufsurface: NvBufSurfaceCopy: failed to copy
ERROR in BufSurfacecopy
ERROR from sink_sub_bin_encoder2: Failed to process frame.
Debug info: gstv4l2videoenc.c(1220): gst_v4l2_video_enc_handle_frame (): /GstPipeline:pipeline/GstBin:processing_bin_1/GstBin:sink_bin/GstBin:sink_sub_bin2/nvv4l2h264enc:sink_sub_bin_encoder2:
Maybe be due to not enough memory or failing driver
ERROR from sink_sub_bin_encoder1: Failed to process frame.
Debug info: gstv4l2videoenc.c(1220): gst_v4l2_video_enc_handle_frame (): /GstPipeline:pipeline/GstBin:processing_bin_0/GstBin:sink_bin/GstBin:sink_sub_bin1/nvv4l2h264enc:sink_sub_bin_encoder1:
Maybe be due to not enough memory or failing driver
Could not allocate cuda host bufferCould not allocate cuda host bufferSegmentation fault (core dumped)

Any idea what would be causing this?

why use 1 sw encoded? if there is not this sw encoding, is this issue reproduced?

Hi

  1. RTX 2080 Ti supports only 3 concurrent encoding sessions. If we enable the 4th sink to use the HW Encoder, then the application throws an error in this regard. So, other sinks (sink3-sink7) use SW Encoder type
  2. The application doesn’t crash if only HW encoding is used (for sink0-sink2). However, all other sinks are disabled in this case (due to above reason)
  3. We are using multiple parallel RTSP output streams (instead of single tiled output)

The detailed configuration and related issue of the RTSP output stream performance is being discussed here:
https://forums.developer.nvidia.com/t/deepstream-app-buffer-caching-observed-when-using-yolov3-with-multiple-rtsp-output-streams/140298/9

However, these two are likely to be independent and hence this issue is being explored separately.

Looking forward to inputs on this.
Thanks.

is this issue reproduced with only sw encoder?
if yes, can you share the pipeline?
I think this may be related how your sw encoder access the raw buffer for encoding.

This is reproduced when we use HW and SW Encoders are used together.
When we run with only the SW encoding, the application doesn’t crash (say in about 2 hours of run).

We are using slightly modified deepstream-app (with the latest patch from NVIDIA to fix a crash related to latency):
‘N’ RTSP Streams->Mux->nvinfer(YoloV3)->tracker->osd->demux-> ‘N’ RTSP streams.

YoloV3 is from the objectDetection_Yolo sample.

The config file is available here in the previous message that I posed.

Machine configuration is:
CPU: Ryzen 9 3900X, (12 Cores, 24 Threads)
Memory: 64GB
GPU: ZOTAC GAMING GeForce RTX 2080 Ti Twin Fan 11GB GDDR6

from the log, seems the pipeline has got running for a while.
And, from above log, is may fail due to out of memory, did you monitor the GPU memory usage when running?
How many fps can your sw encoder handle? And, what’s your target encoding fps in your case?

Hi,

The throughput when there are no inference is enabled is 30 FPS for 8 x 720p RTSP streams. However, when inference is enabled, then the throughput reduces to 13 FPS, with the same 8 x 720 x 30 FPS RTSP input.

When the application starts:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:09:00.0 Off | N/A |
| 69% 79C P2 178W / 250W | 1925MiB / 11018MiB | 35% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

The above remains constant until the crash occurs.

There are couple of error messages seen durign the run:
(deepstream-app:12589): GStreamer-CRITICAL **: 20:15:37.081: gst_buffer_get_sizes_range: assertion ‘GST_IS_BUFFER (buffer)’ failed
**PERF: 12.98 (13.08) 12.98 (13.08) 12.98 (13.08) 12.98 (13.08) 12.98 (13.08) 12.98 (13.08) 12.98 (13.08) 12.98 (13.08)

Finally, when the crash occurs, following is the memory status:
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 12589 C …stream-app/deepstream-app 1793MiB |
±----------------------------------------------------------------------------+
GPU 00000000:09:00.0: Detected Critical Xid Error
GPU 00000000:09:00.0: Detected Critical Xid Error
Fri Jul 10 20:28:39 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:09:00.0 Off | N/A |
| 69% 78C P2 106W / 250W | 1459MiB / 11018MiB | 38% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

The above is when using the 3 HW ENcoders and 5 SW ENcoders.

checked the confile file, please try setting the batch-size to be the same as the number of input source.
can your sw encoder support to encode 65fps (5 x 13fps) ?

Hi,

Yes, the batch-size is same as the number of input sources.
Yes, the CPU has the capacity support 5 x 13 FPS (CPU % used by the deepstream-app is about 23%, Load Average is 2-5 on a AMD Rayzon 3900 CPU with 12 core/24 threads. So there is enough capacity).

The throughput is 13 FPS when we have 3 HW and 5 SW Encoders with mux->nvinfer-> tracker ->demux pipeline running under deepstream-app

Hi @mchi

I see that this is directly related to the driver. Should a bug report be filed?

Hi @deepak,
sorry, do you mean this issue got fixed by updating the CUDA driver?

Thanks!

Hi @mchi

No. As this has NOT been fixed so far and/or no reason for the failure is found, I was wondering if this should be reported as a bug to the development team.

Is it possible to share us a repo?

Thanks!

Hi,

This is the standard deepstream-app running on:

CPU: Ryzen 9 3900X, (12 Cores, 24 Threads)
Memory: 64GB
GPU: ZOTAC GAMING GeForce RTX 2080 Ti Twin Fan 11GB GDDR6

You have already seen the configuration file.
Hope this helps.

https://drive.google.com/file/d/1dJvVo0FnELeZFEQFgl6LEDDP7ujDxUGH/view?usp=sharing

FYI: This will be available for a short period. If you are unable to access this, please send a private message/e-mail and we will upload again and share it with you.

@deepak
Thanks for the repo. It’s helpful.
We will check and get back to you.

Sorry!
Could you share the repo steps with this package?

Hi,

You could go to the Yolo folder and run the …/deepstream-app/deepstream-app -c <YoloV3-8input-infer-sec123-face-analytics.txt> file (or any other config file that has 4 RTSP inputs).

You can either use RTSP cameras or simulate RTSP streams using:
cvlc sample_1080p_h264.mp4 :sout=#gather:transcode{scodec=none}:rtp{sdp=rtsp://:9000/} :no-sout-all :sout-keep :loop

If this doesn’t work for you, just use the standard deepstream-app provided by NVIDIA and use multiple input and output RTSP streams (as shown in the config file).

Hi @deepak,
Thanks for your repo, we can rperoduce this issue now.
We will debug it and give you update when we have progress.

Thanks!

Hello,

Now that DeepStream 5.0 is announced for General Availability, has this been fixed in the latest release?

Hey,
I am facing a similar error. Have you managed to find a fix for this?

Thanks.