I have been experiencing a reoccurring issue with my jetson orin agx recently but it’s not 100% clear to me what the cause of this issue is. We have numerous systems running essentially the same system and configuration (about 8-10 systems), but we have not experienced this issue on any of the other ones yet.
From a little bit of testing, it looks like it relates to the startup of one of our inference systems (models being loaded and started by the Stereolabs zed SDK) and/or the pair of Stereolabs ZED2 cameras. The cameras frequently fail to register (initially registering but then disconnecting afterward) but if they do register we experience the GPU error report. I have attached a copy of one of the recent dmesg logs.
Does anyone have any insight on this issue? Does anything in the logs stick out indicating what the cause of the issue could be? Any advice or assistance on this issue is greatly appreciated!
weird_gpu_failure.log (93.5 KB)