AGX Xavier freeze in MAXN mode

Running a CPU&GPU intensive application (training a model with Coqui-TTS) I see reproducible freeze of my AGX Xavier a few minutes after setting it to nvpmodel 0 / MAXN mode. Fan profile is set to “cool”. The same application runs for hours and days in “30W all” mode without problems.

The last lines of dmesg --follow before the crash are:

[71675.256574] FAN rising trip_level:2 cur_temp:61350 trip_temps[3]:62000
[71679.736282] FAN rising trip_level:3 cur_temp:62100 trip_temps[4]:73000
[71684.217226] FAN rising trip_level:2 cur_temp:60600 trip_temps[3]:62000
[71685.340099] FAN rising trip_level:2 cur_temp:60900 trip_temps[3]:62000
[71687.579923] FAN rising trip_level:2 cur_temp:61350 trip_temps[3]:62000
[71692.059994] FAN rising trip_level:2 cur_temp:61700 trip_temps[3]:62000
[71695.415514] FAN rising trip_level:2 cur_temp:61200 trip_temps[3]:62000
[71699.895299] FAN rising trip_level:2 cur_temp:61000 trip_temps[3]:62000
[71704.375099] FAN rising trip_level:2 cur_temp:61400 trip_temps[3]:62000
[71706.618925] FAN rising trip_level:2 cur_temp:61600 trip_temps[3]:62000
[71711.094740] FAN rising trip_level:2 cur_temp:61650 trip_temps[3]:62000
[71715.574514] FAN rising trip_level:2 cur_temp:61950 trip_temps[3]:62000
[71720.054299] FAN rising trip_level:3 cur_temp:62250 trip_temps[4]:73000
[71724.538043] FAN rising trip_level:2 cur_temp:61250 trip_temps[3]:62000
[71727.893844] FAN rising trip_level:2 cur_temp:61350 trip_temps[3]:62000
[71729.013868] FAN rising trip_level:2 cur_temp:61550 trip_temps[3]:62000
[71732.373649] FAN rising trip_level:2 cur_temp:61950 trip_temps[3]:62000
[71736.853423] FAN rising trip_level:2 cur_temp:61700 trip_temps[3]:62000

Temperature rises slowly above trip level (62°C), fan speeds up, shortly after that the freeze occurs: fan stops spinning, device does not respond, need to reset by pressing the hardware button.

This might be related to

or AGX Xavier easy to crash when ethernet network connected - #21 by simon.glet
but I don’t see any error messages in syslog/kern.log

Any ideas and recommendations how to further investigate the issue? Do I have a defective device?

hello dkreutz,

this sounds abnormal.
please refer to developer guide, Thermal Specifications.
there’ll be hardware thermal shutdown triggered to shutdown the platform when all other cooling strategies have failed, and in particular, after software shutdown has failed to occur when it should.
in addition,
as you can see in the supported power states. 62°C did not even reach SW or HW throttling.

may I know the details steps that how you enable MaxN mode?
had you use the jetson-xavier-maxn.conf board flash configuration to flash the target?
thanks

Hi Jerry,

Thanks for looking into this.

I have used jtop (from jetson_stats) to set power mode as I assume it performs the same as a manual sudo nvpmodel -m 0
jetson-clocks is not enabled in my test scenario.

I am not sure what “board flash configuration” means. I received my AGX Xavier DevKit in January 2020 and initially flashed it with SDK-manager from a Linux host. Since then I performed OTA updates of JetPack on Xavier (currently JP 4.5.1). Don’t know if this is important: I have installed a nvme-SSD and boot from there (following the instructions from jetsonhacks)

hello dkreutz,

if you go through SDKManager, it’s by default using jetson-xavier.conf to flash the board.
please have a try to flash the board with that and try to reproduce the issue.
for example, $ sudo ./flash.sh -r jetson-xavier-maxn mmcblk0p1
it’s jetson-xavier-maxn.conf board flash configuration, which include different configuration files, such as mb1-bct, bpmp-dtb, dtb. and flashing to the target.
thanks

Hello Jerry,

Took me some time to prepare “Linux for Tegra” environment with ./flash application. Do I understand correctly that the proposed ./flash -r ... does only update some board configuration? I want to avoid to I accidently wipe my boot partition…

Eventually flashed the Xavier AGX but the problem with freezes - actually it is a power off/shutdown - persists.

Last lines of the flash command log was

l4t_sign_image.sh: Generate 16-byte-size-aligned base file for kernel_tegra194-p2888-0001-p2822-0000-maxn_sigheader.dtb.encrypt
l4t_sign_image.sh: the signed file is /home/dominik/Linux_for_Tegra/bootloader/temp_user_dir/kernel_tegra194-p2888-0001-p2822-0000-maxn_sigheader.dtb.encrypt
done.
Reusing existing system.img... 
file does not exist.

How can I verify that my flash command was executed successfully?

If this is your first time doing the flash, please do not use “-r” in your flash command.

Also, successful flash will show “flash successfully” log, if you didn’t see that, then the flash process is not done.

hello dkreutz,

you must have system.img to enable the -r options.
this -r option will skip building system.img, and reuse the existing one.
please check Flash Script Usage for more details.
thanks

Hello Jerry,

After succesful flash the issue still persists.
Any hints how to further investigate this?

did you still see Xavier freeze even thermal not reaching the trip point of SW throttling?

Yes, my AGX Xavier still shuts down (power off) unexspected when running in MAXN mode. Some times it takes several minutes, some times it happens a few seconds after switching to nvpmodel “0”.

I don’t see any suspicious messages in the usual logs (dmesg, syslog, etc.). Which place should I look for error messages, can some extended/debug logging be activated?

note: the last time i successfully ran the same task/application without any issues was on JetPack 4.4. Now I am on JP 4.5.1.

hello dkreutz,

were you test it before on the same Jetson AGX Xavier platform?
please rule out it’s hardware issue or not, would you please rollback to JetPack-4.4 and configure as MaxN to have verification.
thanks

Ok, i will try that. it will take make probably until next weekend to perform rollback, will report back when it is done.

Hi @JerryChang same problem here !

Hi @dkreutz which power supply do you use ?

Any other suggestions? Downgrading is possible for me.

Best Regards
Martin

I am using the power supply that was shipped with my AGX Xavier DevKit (LiteOn PA-1650-90, 19V, 65W).

1 Like

I did a fresh install of JetPack 4.4 - unfortunately I still see same behaviour: Xavier AGX shuts down unexspectedly and without any warning/error as soon as its powermode/nvpmodel is switched to MAXN/0 while running a CPU&GPU hungry application.

I also encountered an auto-reboot issue in maxn/fan:cool mode.

Did you check bluetooth interrupt counter in your device?

$ grep blue /proc/interrupts

It increased faster and faster utill cpu or net stalled in my devkit

This returns

 392:        165          0          0          0          0          0          0          0  tegra-gpio  192 Edge      bluetooth hostwake

but what do these numbers tell me?
And why is there bluetooth at all - to my knowledge Xavier AGX DevKit has no BlueTooth on board?

Does this counter increase very fast while running your intensive application?

Please check if this patch can help the interrupt in bluetooth hostwake and also the reboot/shutdown issue.

Patch 1.

diff --git a/drivers/misc/bluedroid_pm.c b/drivers/misc/bluedroid_pm.c
index 3708edd..9bfe1d7 100644
--- a/drivers/misc/bluedroid_pm.c
+++ b/drivers/misc/bluedroid_pm.c
@@ -332,7 +332,7 @@
 		BDP_DBG("found host_wake irq\n");
 		ret = request_irq(bluedroid_pm->host_wake_irq,
 					bluedroid_pm_hostwake_isr,
-					IRQF_TRIGGER_RISING,
+					IRQF_TRIGGER_NONE,
 					"bluetooth hostwake", bluedroid_pm);
 		if (ret) {
 			BDP_ERR("Failed to get host_wake irq\n");

Patch 2.

diff --git a/common/tegra194-p2888-0001-p2822-0000-common.dtsi b/common/tegra194-p2888-0001-p2822-0000-common.dtsi
index 3453ce4..079f785 100644
--- a/common/tegra194-p2888-0001-p2822-0000-common.dtsi
+++ b/common/tegra194-p2888-0001-p2822-0000-common.dtsi
@@ -73,7 +73,7 @@
 		bluedroid_pm,host-wake-gpio = <&tegra_main_gpio TEGRA194_MAIN_GPIO(Y, 0) 0>;
 		bluedroid_pm,ext-wake-gpio = <&tegra_main_gpio TEGRA194_MAIN_GPIO(M, 7) 0>;
 		interrupt-parent = <&tegra_main_gpio>;
-		interrupts = <TEGRA194_MAIN_GPIO(Y, 0) 0x01>;
+		interrupts = <TEGRA194_MAIN_GPIO(Y, 0) IRQ_TYPE_LEVEL_LOW>;
 	};
 
 	spi@c260000 {