The GPU does not work while DLA is running, and restarting the container does not allocate memory properly

hi
I used jetson agx orin and ran one algorithm model per DLA0 and DLA1.
The algorithm model was started as a container, and the container used the deepstream-test5 example.
While the algorithm was running, the gpu went offline and couldn’t allocate memory. The current approach is to restart the machine. Here is what I print using dmesg

dmesg.txt (149.7 KB)

I ran an algorithm model yesterday afternoon, and the data could be deduced at 21:07 in the evening, but no data was found after that, because it was an area invasion and it was at night, so there is no way to determine when the problem occurred. This is my log, please help analyze it
log.txt (149.1 KB)

Can you share the deepstream-test5 container log? Do you see any error log when GPU stop working?

Please pay attention to the log after 18:00 on August 5th
deepstream-crowd-detect-dla1-v100.log (4.9 MB)

The fps drop to 0 after below log. Seems RTSP connect error. Are you see below log everytime when you need reboot the Orin device?

10:50:30.820953811 e[32m 19e[00m 0xaaab341d9aa0 e[33;01mWARN e[00m e[00m nvinfer gstnvinfer.cpp:2420:gst_nvinfer_output_loop:<primary_gie>e[00m error: Internal data stream error.
10:50:30.820982131 e[32m 19e[00m 0xaaab341d9aa0 e[33;01mWARN e[00m e[00m nvinfer gstnvinfer.cpp:2420:gst_nvinfer_output_loop:<primary_gie>e[00m error: streaming stopped, reason error (-5)
ERROR from tracking_tracker: Failed to submit input to tracker
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvtracker2/gstnvtracker.cpp(792): gst_nv_tracker_submit_input_buffer (): /GstPipeline:pipeline/GstBin:tracking_bin/GstNvTracker:tracking_tracker
ERROR from primary_gie: Internal data stream error.
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(2420): gst_nvinfer_output_loop (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInfer:primary_gie:
streaming stopped, reason error (-5)
ERROR from tracking_tracker: Failed to submit input to tracker
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvtracker2/gstnvtracker.cpp(792): gst_nv_tracker_submit_input_buffer (): /GstPipeline:pipeline/GstBin:tracking_bin/GstNvTracker:tracking_tracker
gstnvtracker: All sub-batches are fully allocated. Modify “sub-batches” configuraion to accommodate more number of streams
gstnvtracker: Batch 913653 already active!
nvmultiurisrcbin ERROR from element udpsrc85: Internal data stream error.
ERROR from udpsrc85: Internal data stream error.
Debug info: …/libs/gst/base/gstbasesrc.c(3127): gst_base_src_loop (): /GstPipeline:pipeline/GstBin:multiuri_src_bin/GstDsNvMultiUriBin:src_nvmultiurisrcbin/GstBin:src_nvmultiurisrcbin_creator/GstDsNvUriSrcBin:dsnvurisrcbin16/GstRTSPSrc:src/GstUDPSrc:udpsrc85:
streaming stopped, reason error (-5)
nvmultiurisrcbin ERROR from element udpsrc79: Internal data stream error.
ERROR from udpsrc79: Internal data stream error.
Debug info: …/libs/gst/base/gstbasesrc.c(3127): gst_base_src_loop (): /GstPipeline:pipeline/GstBin:multiuri_src_bin/GstDsNvMultiUriBin:src_nvmultiurisrcbin/GstBin:src_nvmultiurisrcbin_creator/GstDsNvUriSrcBin:dsnvurisrcbin17/GstRTSPSrc:src/GstUDPSrc:udpsrc79:
streaming stopped, reason error (-5)
Active sources : 1
Mon Aug 5 19:10:07 2024
**PERF:
303239333436003d209de75804910000 22.28 (13.77) 303239333436003d209d8214ded00000 0.00 (0.00)
Active sources : 0
Mon Aug 5 19:10:12 2024

I compared the logs twice, and they both showed this, but I’m not sure when the problem occurred

I think above error is RTSP connect error, suppose it can recover if reconnect RTSP.

Can you do more test to catch the Kernel log and DeepStream log when the error happen which you must reboot the device?

hi
This morning we found out that the gpu stopped once, even though the gpu is working again without operation, there is something wrong with the identified data, confidence is -0.1, coordinate is 0. As I understand it, the target was not identified, in which case I restarted the container and the program worked.
I think there are two problems: 1) when the gpu is not working, restarting the container will not work. 2) When the gpu is back to normal, restarting the container will work properly.
I have followed your idea and tested it with a relatively stable rtsp stream
dmesg.txt (147.2 KB)

Is it possible to reproduce the issue with JPS AI NVR or Deepstream samples? We can use nvstreamer to simulate RTSP camera. It is more effecient to debug if you can reproduce the issue with NV release or share your application to us for debugging the issue.

        I used a stable video source for two days of testing and did not find any problems with gpu and DLA. That is, the machine is fine when the video source is relatively stable.
       But just now at 12:01 I mistakenly restarted the streaming push application and deepstream's container lost the video source, reoccurring the problem. Software error caused hardware failure, I think this is a bug, here is my log.
      I think you should try the video source loss case when you are reproducing the problem. If you are using nvstream as the rtsp server, you can just kill the nvstream container.

deempstream.txt (11.6 KB)
dmesg.txt (147.7 KB)

I found below log in your Kernel log. But I tried “docker kill nvstreamer” in my side, I can’t reproduce the issue. I add one RTSP camera into VST after “docker kill nvstreamer”, DeepStream works fine.

[Fri Aug 9 13:39:24 2024] tegra-vic 15340000.vic: deepstream-test: job submission failed: host1x job submission failed: -4

I did add two algorithm models, each with two input video streams, and not every disconnection can be reproduced

Can you help to have a try JPS release without your models? It is more effecient to debug the issue If I can reproduce the issue in my side.

At present, I only have one machine, and the jps environment of this machine is lost due to the problem of jetson-storage before. At present, this machine is configured with the development environment, and the docker configuration also points to a specific repository, so it cannot test against jps recently

Are you find any scenarios to reproduce the issue more easy? We can involve more engineer to check the issue if I can reproduce the issue in our side easily.

Can you share us the brand of the unstable camera? I want to check if it is in our support list: VST — Metropolis on Jetson documentation 0.1.0 documentation

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.