Deepstream-app stopped working after 7 days

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): Jetson AGX Xavier
• DeepStream Version: 5.0
• JetPack Version (valid for Jetson only): 4.4
• TensorRT Version: TensorRT 7.1.3
• Issue Type( questions, new requirements, bugs):questions/bugs

Hi, I am running a deepstream-app pipeline with config files attached here.
ds_app_config_4ch_yoloV3.txt (5.2 KB)

I was streaming four 10-hr videos sources from an external hard drive and enabled the ‘file-loop = 1’ at the bottom of config, in order to let it start over once the videos come to their end. It’s been running normally for the past 7 days, but today the pipeline threw out an exception. It stopped as follows:

ERROR: Failed to synchronize on cuda copy-coplete-event, cuda err_no:6, err_str:cudaErrorLaunchTimeout
109:50:03.492845088 23957     0x16f69590 WARN                 nvinfer gstnvinfer.cpp:2012:gst_nvinfer_output_loop:<primary_gie> error: Failed to dequeue output from inferencing. NvDsInferContext error: NVDSINFER_CUDA_ERROR
ERROR from primary_gie: Failed to dequeue output from inferencing. NvDsInferContext error: NVDSINFER_CUDA_ERROR
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(2012): gst_nvinfer_output_loop (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInfer:primary_gie
109:50:03.494806368 23957     0x16f69590 WARN                 nvinfer gstnvinfer.cpp:616:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Warning from NvDsInferContextImpl::releaseBatchOutput() <nvdsinfer_context_impl.cpp:1606> [UID = 1]: Tried to release an outputBatchID which is already with the context
Quitting
ERROR: Failed to synchronize on cuda copy-coplete-event, cuda err_no:6, err_str:cudaErrorLaunchTimeout
109:50:03.515479392 23957     0x16f69590 WARN                 nvinfer gstnvinfer.cpp:2012:gst_nvinfer_output_loop:<primary_gie> error: Failed to dequeue output from inferencing. NvDsInferContext error: NVDSINFER_CUDA_ERROR
109:50:03.515576128 23957     0x16f69590 WARN                 nvinfer gstnvinfer.cpp:616:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Warning from NvDsInferContextImpl::releaseBatchOutput() <nvdsinfer_context_impl.cpp:1606> [UID = 1]: Tried to release an outputBatchID which is already with the context
109:50:03.524366176 23957     0x170889e0 ERROR                nvinfer gstnvinfer.cpp:1103:get_converted_buffer:<primary_gie> cudaMemset2DAsync failed with error cudaErrorLaunchTimeout while converting buffer
109:50:03.524570048 23957     0x170889e0 WARN                 nvinfer gstnvinfer.cpp:1363:gst_nvinfer_process_full_frame:<primary_gie> error: Buffer conversion failed
ERROR: [TRT]: ../rtSafe/safeContext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)
ERROR: [TRT]: FAILED_EXECUTION: std::exception
ERROR: Failed to enqueue trt inference batch
ERROR: Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
109:50:03.564645824 23957     0x16fb80f0 WARN                 nvinfer gstnvinfer.cpp:1216:gst_nvinfer_input_queue_loop:<primary_gie> error: Failed to queue input batch for inferencing
ERROR: Failed to make stream wait on event, cuda err_no:6, err_str:cudaErrorLaunchTimeout
ERROR: Preprocessor transform input data failed., nvinfer error:NVDSINFER_CUDA_ERROR
109:50:03.564916672 23957     0x16fb80f0 WARN                 nvinfer gstnvinfer.cpp:1216:gst_nvinfer_input_queue_loop:<primary_gie> error: Failed to queue input batch for inferencing
ERROR from primary_gie: Failed to dequeue output from inferencing. NvDsInferContext error: NVDSINFER_CUDA_ERROR
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(2012): gst_nvinfer_output_loop (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInfer:primary_gie
ERROR from primary_gie: Buffer conversion failed
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(1363): gst_nvinfer_process_full_frame (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInfer:primary_gie
ERROR from primary_gie: Failed to queue input batch for inferencing
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(1216): gst_nvinfer_input_queue_loop (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInfer:primary_gie
ERROR from primary_gie: Failed to queue input batch for inferencing
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(1216): gst_nvinfer_input_queue_loop (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInfer:primary_gie

I am also attaching the Nvidia bug report log here:
nvidia-bug-report-tegra.log (322.9 KB)

It looks like errors coming from inside of CUDA. Would you mind telling me what’s going on here and how can I fix it? Thank you!

could you share the log (nvidia-bug-report-tegra.log) captured with command below when issue is reproduced ?

$ sudo nvidia-bug-report-tegra.sh

Hi, I have already attached the nvidia-bug-report-tegra.log in my post above. Right below the error message.

Did you run below command before test?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

I didn’t do it. By saying test, do you mean running command to generate nvidia-bug-report-tegra.log? Do I need to do so first?

No, the two commands have nothing to do with generating nvidia-bug-report-tegra.log, they are used to boost and lock CPU, GPU and EMC clocks to max.
Without them, DVFS is working on CPU, GPU and EMC, under stress test, it may run into above issue.

I didn’t run these commands before deepstream-app, but I know it was running in MAXN mode. Is it something related?

sudo nvpmodel -m 0 ==> this is to set MAXN to set the max limitation of CPU/GPU/EMC clocks
sudo jetson_clocks ==> this is to lock the CPU/GPU/EMC to max clocks, it also disabed DVFS

Disabing DVFS may help this issue

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.