hi
I used jetson agx orin and ran one algorithm model per DLA0 and DLA1.
The algorithm model was started as a container, and the container used the deepstream-test5 example.
While the algorithm was running, the gpu went offline and couldn’t allocate memory. The current approach is to restart the machine. Here is what I print using dmesg
I ran an algorithm model yesterday afternoon, and the data could be deduced at 21:07 in the evening, but no data was found after that, because it was an area invasion and it was at night, so there is no way to determine when the problem occurred. This is my log, please help analyze it log.txt (149.1 KB)
hi
This morning we found out that the gpu stopped once, even though the gpu is working again without operation, there is something wrong with the identified data, confidence is -0.1, coordinate is 0. As I understand it, the target was not identified, in which case I restarted the container and the program worked.
I think there are two problems: 1) when the gpu is not working, restarting the container will not work. 2) When the gpu is back to normal, restarting the container will work properly.
I have followed your idea and tested it with a relatively stable rtsp stream dmesg.txt (147.2 KB)
Is it possible to reproduce the issue with JPS AI NVR or Deepstream samples? We can use nvstreamer to simulate RTSP camera. It is more effecient to debug if you can reproduce the issue with NV release or share your application to us for debugging the issue.
I used a stable video source for two days of testing and did not find any problems with gpu and DLA. That is, the machine is fine when the video source is relatively stable.
But just now at 12:01 I mistakenly restarted the streaming push application and deepstream's container lost the video source, reoccurring the problem. Software error caused hardware failure, I think this is a bug, here is my log.
I think you should try the video source loss case when you are reproducing the problem. If you are using nvstream as the rtsp server, you can just kill the nvstream container.
I found below log in your Kernel log. But I tried “docker kill nvstreamer” in my side, I can’t reproduce the issue. I add one RTSP camera into VST after “docker kill nvstreamer”, DeepStream works fine.
At present, I only have one machine, and the jps environment of this machine is lost due to the problem of jetson-storage before. At present, this machine is configured with the development environment, and the docker configuration also points to a specific repository, so it cannot test against jps recently
Are you find any scenarios to reproduce the issue more easy? We can involve more engineer to check the issue if I can reproduce the issue in our side easily.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks