Jetson AGX Xavier self rebooting

@ynjiun : That would tend to imply 100C is a trip temperature, rather than actual temperature, so you are correct about that. I went and examined a couple of Jetsons and they all had that behavior. Some of the temperature monitoring only tells you about trip point, rather than being an actual measurement, and this is apparently one of them.

@simon.glet , that appears to be the same error. What kind of hardware was the video from? I could see the possibility of any virtual desktop making custom adjustments to networking and triggering something which is not commonly occurring in most situations (a corner case). This particular case also shows (as you mentioned) some GPU involvement higher up in the stack frame, and then below this in the stack frame are the same network problems. If you have a URL to the video or more information it would help.

What makes this more recent stack frame interesting is that GPU calls were made after network calls, which would make sense if network data is driving GPU activity. In the previous cases which were posted the GPU activity was not necessarily present in the stack frame. There is a strong chance that the GPU is just another way the bug shows up, and is not necessarily the original cause. A network error should be correctable, but seems to cause rebooting; however, perhaps the GPU driver also is not handling the error condition which has been passed to it.

The first function call which starts something “specific” in the failure is this:

Sep 7 17:11:16 simon-desktop kernel: [18133.445329] [] net_rx_action+0xf4/0x358

…the GPU has not even been involved yet at that point in the stack frame. After some network activity there is another IRQ, and timers start failing. The GPU errors are part of normal logging, and not part of the stack frame, but the GPU error apparently is going on while the stack frame is being dumped:

Sep 7 17:10:02 simon-desktop kernel: [18059.497433] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:10:02 simon-desktop kernel: [18059.497640] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505

I am inclined to believe that the GPU error message is just a side effect of network code gone wrong. The error is that the GPU is in need of acquiring a semaphore, but cannot. This is out of the control of the GPU, and is the result of something else blocking this. It is a bit like driving up to a gas station to refill the car, but there is a line of hundreds of people in front, and one of them has a dead engine…nobody behind that car could access the gas even if some is available.

If you can provide a way to replicate this, then someone from NVIDIA could probably go straight into the stack frame and find the specific network condition which is stalling out. This issue is part of networking, but it is interfering with the GPU when those virtual desktops are involved.

@ynjiun and @simon.glet: This is a good idea (perhaps both of the people with issues could apply this patch and try again):

…I think you’ve just found one of the triggers to the same network issue, and if that patch worked for the other soft lockup, then it will very likely work with virtual desktop network issues as well.

FYI, in theory, if the soft lock is something which is just too high of a load, then running max performance could help, but only to an extent. If there is a software bug causing the soft lockup, then there is no possibility of performance modes helping. Either way the real solution is to stop the soft lockup (and it looks like the patch above is most likely the fix).

I do not think that GPU temperature is the cause. Keep in mind that if the system is running in a lower performance mode that the timers which deal with whether or not there is a soft lockup can also begin later…if there is some sort of data required to send to the GPU, then it is already running prior to the GPU ever trying to use that data. Running in lower performance mode could actually give the data more time to go through the system prior to the soft lockup timer being started. I think the earlier mentioned patch is on target:
https://forums.developer.nvidia.com/t/xavier-with-jp4-2-hangs/72014/8