Jetson AGX Xavier MAXN Mode crashes

Recently we experienced sudden shutdowns of our Xavier when running a script doing Pytorch inferences.
We are seeing no unusual logs in dmesg, kern.log or syslog. We also tried to dmesg over the serial connection.
The Xavier is running with jetson_clocks and MAXN Mode.

When setting the Powermode to something other than MAXN, the scripts runs through.
We are running Jetpack 4.6 [L4T 32.6.1].

Any ideas or recommendations?
Thanks

Hi,

Actually you don’t need to run “dmesg” in serial console, the uart console shall automatically print some log if something really “crash”.

You can share the log first so that we can tell what happened.

Hi Wayne,

thanks for reaching out!

I logged a session with boot → starting the script → boot again after shutdown.
Starting the script was around 11:04 - the shutdown appeared a few seconds later. Around 11:06 I booted the Xavier again.
I attached serial console log, syslog and kern.log

Just to clarify: We run the script in a docker container - but the issue also appears when we setup the environment with venv.

Xavier-AGX_COM8USBSerialPortCOM8_20220223_110450_withReboot.txt (79.3 KB)
kern.log (194.3 KB)
syslog (392.4 KB)

Hi,

So the board just reboot without any sign with below log? Could you share what kind of application is running ?

   * VPI: ii libnvvpi1 1.1.15 arm64 NVIDIA Vision Programming Interface library
   * Vulkan: 1.2.70
------------------------------------------------------------
Mounted on           Size Avail Use%
/media/cdleml/128GB  117G  807M 100%
/                     28G   15G  45%
------------------------------------------------------------

cdleml@xavier:~$ docker start -i pysot
root@13e5c5fc5b4d:/# cd /home/pysot/experiments/siamrpn_mobilev2_l234_dwxcorr/
 ../../tools/test.py --snapshot model.pth --config config.yaml --dataset VOT2018
loading VOT2018: 100%|##################################| 60/60 [00:02<00:00, 29.01it/s, zebrafish1]
��
[0000.425] W> RATCHET: MB1 binary ratchet value 4 is too large than ratchet level 2 from HW fuses.
[0000.434] I> MB1 (prd-version: 1.5.1.7-t194-41334769-98030a79)
[0000.439] I> Boot-mode: Coldboot
[0000.442] I> Chip revision : A02
[0000.445] I> Bootrom patch version : 15 (correctly patched)
[0000.450] I> ATE fuse revision : 0x200

Hi,

so after starting the python script, the fan stops spinning and the led goes off. No further messages in the logs. The xavier stays off. The boot-messages in the logfile are because I pressed the Power-Button on the board after about 2 minutes.

The application is from an object tracking benchmark: GitHub - STVIR/pysot: SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

Basically the steps to reproduce the issue (aside from downloading the dataset and building the python extensions in the repo) are:

cd experiments/siamrpn_mobilev2_l234_dwxcorr
python3 ../../tools/test.py --snapshot model.pth --config config.yaml --dataset VOT2018

I try to put together a step by step guide with a pre configured docker container to reproduce the issue.

Hi,

we set up a container with the precompiled repository.
Here are the steps to reproduce our issue:

Pull docker image:
docker pull allu1234/pysot-xavier-torch19:2.0

Download dataset VOT2018 via drive-link:
https://drive.google.com/file/d/1Nea1OVnkYoVQAPZ7t5RYWPIxMSmKDenN/view?usp=sharing

Run container: (set “pathtodataset” to the location where the VOT2018 folder is located)
docker run --rm -v pathtodataset:/home/pysot/testing_dataset/VOT2018 --runtime nvidia allu1234/pysot-xavier-torch19:2.0 python3 tools/test.py --snapshot experiments/siamrpn_mobilev2_l234_dwxcorr/model.pth --config experiments/siamrpn_mobilev2_l234_dwxcorr/config.yaml --dataset VOT2018

Hi,

Just want to confirm first.

The same issue occurs outside of the container.
You just wrap up the image for us to reproduce, is that correct?

Thanks.

Hi,

yes, that is correct.

Thanks.

We are checking this issue internally.
Will share more information with you later.

This reminds me of the issues I have seen: AGX Xavier freeze in MAXN mode

The patch provided by Nvidia (@WayneWWW did it makeit into the latest JetPack yet?) does not work always, the issue reoccurs time and again…

The patch was just merged in Nov 2021. Thus, I think only jp4.6.1 (rel-32.7.1) has it.

Hi,

We test this on Xavier 16GiB with JetPack4.6.1.
The script run normally with MAXN mode.

Would you mind also giving it a try?
Thank.s

Hi,
thanks for the info. We are going to try it with 4.6.1 next week.

Hi,

I am happy to confirm the issue is fixed with JetPack 4.6.1 and the script runs fine now.
Thanks for your help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.