Lately we experienced crashes/shutdowns when building tensorRT engines with trtexec in MAXN mode on Jetpack 4.6.1. During the build, the Xavier suddenly shuts down with no notable log entries. (attached kern.log, sys.log and trtexec verbose log, timestamp ~ 14:07:06)
We have to power it on again after that by pressing the power button.
The problems only occur in MAXN mode.
We had some similar problems when running pytorch scripts on Jetpack 4.6. An upgrade to 4.6.1 solved these issues.
(Jetson AGX Xavier MAXN Mode crashes - #18 by vovea)
kern.log (104.0 KB)
syslog.txt (43.4 KB)
trtexec.txt (2.6 KB)
Is it possible to test on latest release? Also, could you share sample model to reproduce.
you mean R34.1.1 ?
I got the opportunity today to test the engine build on an AGX Orin (R34.1.1), which went well - but I am not sure if the comparison between the AGX Xavier and AGX Orin makes a lot of sense.
Since we only have one Xavier device, and it runs software from multiple people, it usually takes us longer for version upgrades.
Here is the ONNX, which crashes on engine-build on the AGX Xavier: ONNX model
trtexec --onnx=yolov4_1_3_640_640_static.onnx --saveEngine=yolov4_1_3_640_640_static.engine
Here is the kern.log from today’s AGX Xavier crash: It occured at ~13:48:37, during the trt engine build. It seems that the device has rebooted itself.
kern.log (102.7 KB)
I don’t see any issue with latest release( Jetpack 5.0.1) on Jetson Xavier. Could you consider upgrading to latest release to avoid this issue and to get TensorRT features.
thanks for the info. Were you able to reproduce the issue with Jetpack 4.6.1? I ask because we see similar problems on our Jetson Nano where no Jetpack 5 is available.
I updated our Xavier yesterday to Jetpack 5.0.1 and the issue when I build the engine from the mentioned ONNX is still there.
Here is the kern.log
kern.log (730.1 KB)
The last entry I can see is
Jun 30 22:52:54 xavier kernel: [ 8320.783534] NVRM: No NVIDIA GPU found.
Do you still have issue? Do you see any CUDA sample running? Could you share CUDA Device Query sample output?
it turned out, that the problem was a shunt (0.1 Ohm) for power measurements we put into the voltage supply line to the carrier board. However, it is not clear to us, how this shunt can cause the mentioned issue.
Is it possible, that the voltage drop across the shunt can trigger the DV/Dt circuit, which sets the VDDIN_PWR_BAD signal and initiates a shutdown? If so, should not be this a clean shutdown with mentions in the logfile?
(I can also open a new topic on this)
Thanks in advance.
Thanks for update. Please file a new topic if this issue can be reproduced.