AGX Xavier easy to crash when ethernet network connected

ynjiun · September 7, 2020, 12:04am

I can run multiple deepstreem_test_3.py (up to 7 pipeline with each feeding 4 video files) without crash the system (or causing it self reboot) if I disconnect ethernet network on AGX Xavier (by clicking top right corner network icon and select disconnect right below wired connection 1) .

However, if I reconnect the network and run the above same multiple deepstream_test_3.py then the system crash (self reboot).

Steps to duplicate the crash (self reboot):

turn on AGX and make sure network is connected
set MAXN mode, set fan at 255
run 5 to 7 copies of deepstream_test_3.py feeding 4 videos each (the more copy to run the easier to duplicate the problem)
go to PC (running ubuntu 18.04) and “ssh agx.local” to connect to the AGX and then run tegrastats in background to log the status every second, then use “tail tegralog” to view the log frequently
around the 5th or 6th copy of deepstream_test_3.py running, the system crash (then self reboot)

Background: it has been a long way to lead to this path. Initially I suspect my power supply voltage swing, so I add a 600W line conditioner to eliminate the power issue. Then I suspect it is thermal issue, but check the tegrastats, the GPU temperature never exceed 47C, of course other CPU, thermal temperature are lower than 47C. Eventually thanks to linuxdev pointed out in one of my self-rebooting logs actually the network causing the self reboot! And this lead to this post of showing how to duplicate the issue. Attached please find the serial console log 7_run4_network_on_crash.log (233.4 KB) and tegrastats log 7_run4_network_on_crash_tegrastats.log (80.6 KB)
when the system crash. Be aware that the network error may not always show up in the console log. But so far whenever the network is on, the system is not stable. I have been changing two different routers, the result is the same: Network on, system very easy crash when running multiple pipelines. Network off, the system is very solid so far.

Question: my product need to turn on ethernet to transmit the result in real time, now whenever the network is on, the system is not stable (kept self rebooting), how can we overcome this issue? Plus WiFi is not a solution for our product. Please help. Thanks a lot in advance.

alanz · September 7, 2020, 5:51am

Could you share the JP version you are working on?

You may get with $ head -1 /etc/nv_tegra_release

ynjiun · September 7, 2020, 3:58pm

Could you share the JP version you are working on?
You may get with $ head -1 /etc/nv_tegra_release

# R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020

ynjiun · September 7, 2020, 6:51pm

This morning after running few apps and everything seems normal. But leave the unit on with network connected, after few hours later even without running any thing, the unit self reboot…

Attached please find the serial console log of this event self_reboot_not_running_anything.log (295.1 KB)

ynjiun · September 7, 2020, 9:53pm

Hi alanz, I found there is ethernet kernel patch in here, shall we patch the kernel? or the JP 4.3 already includes the fix? Please advise. Thanks.

ynjiun · September 7, 2020, 11:21pm

another serial console log for self reboot, this time no network is connected and it still self reboot self_reboot_network_off_running_nothing.log (98.9 KB)

alanz · September 8, 2020, 1:27am

Pls try with Jetpack 4.4, In the same time I’ll try to see if I can reproduce the issue.

ynjiun · September 8, 2020, 2:20am

my AGX is already flashed and installed with Jetpack 4.4

alanz · September 8, 2020, 3:00am

Yes, you are right.

I’m tring to reproduce, will back later.

alanz · September 8, 2020, 8:56am

After 6 hours test with network on, I haven’t got the kernel panic.
I tried with JP 4.4 on xavier Devkit.

simon.glet · September 8, 2020, 1:52pm

Hi @alanz

We are not exactly sure what is causing the reboots but it is happening. You might want to have a look at Jetson AGX Xavier self rebooting - #46 by simon.glet.

Thanks
Simon

ynjiun · September 8, 2020, 4:14pm

Hi alanz, I am curious what’s your duplication environment:

do you connect AGX to a monitor (display) or headless? if it’s headless, what do you use to connect the unit? ssh? or VNC?
what’s the power mode? MAXN? or other?
did you run “sodu jetson_clocks” before your testing?or not?
did you run any apps in this 6 hours?
what’s your JP version “head -1 /etc/nv_tegra_release”?
what’s the GPU temperature during the running?
did you ever encounter “INFO: rcu_sched detected stalls on CPUs/tasks: 0” during 6 hours?

Thanks for these information. This can calibrate between what’s the difference between your system vs. ours.

Attached more self reboot console log last night (it constantly happened) multiple_self_reboot.log (529.7 KB)
When self rebooting constantly happens, I noticed few things:

GPU/CPU/thermal temeprature > 35C (even running no apps) in 28C room temperature.
CPU 1 loading > 98% almost always at 100% don’t know what’s running although the unit does not run any apps.
the unit will go into a mode that constantly reboot itself every few minutes. And I have to shut it down by pulling the plug and leave it overnight (I cannot work on this unit anymore…)

This morning, when I turn on the unit, all CPU/GPU/thermal < 32C, CPU 1 loading < 10%, everything seems stable and normal.

What does this imply? I have been suspecting the thermal sensitive of this unit for a long time, but never can “duplicate/nail it” in a solid way, when it happens (self rebooting), then it happens consecutively… and need to wait to next day to “clear” it up. Very strange behaviour. (basically it’s not usable anymore…; (

simon.glet · September 8, 2020, 4:51pm

Hi @ynjiun

I think you are on to something with the temperature.

The default fan setting is quiet which has a trip temp of 46C. I changed the setting to cool which has a trip temperature of 35C with:
sudo nvpmodel -d cool

Since then, the devkit has been playing youtube HD full screen videos non-stop with no issue.

Here is the latest tegrastats:
RAM 2440/31925MB (lfb 6939x4MB) SWAP 0/15963MB (cached 0MB) CPU [31%@2265,27%@2265,22%@2265,24%@2265,31%@2265,38%@2265,36%@2265,43%@2265] EMC_FREQ 0% GR3D_FREQ 28% AO@34C GPU@34.5C Tdiode@36.5C PMIC@100C AUX@34C CPU@36C thermal@34.95C Tboard@34C GPU 619/670 CPU 4183/3586 SOC 2788/2544 CV 154/154 VDDRQ 929/897 SYS5V 2564/2474

Cheers
Simon

linuxdev · September 8, 2020, 7:45pm

It looks like this is probably fixed here:
https://forums.developer.nvidia.com/t/xavier-with-jp4-2-hangs/72014/8

WayneWWW · September 9, 2020, 2:43am

Hi ynjiun,

For your case, could you give us a summary of how many issues you’ve filed?

It looks like all of them are connected but not separate issues…

For example, I saw you have below topic too. Plus the previous “power supply” issue I saw. You’ve filed 3 topics and all of them are same to me.

As I pointed out in the power supply topic, you always see kernel panic before system reboots. And that kernel panic is from ethernet driver. That is also connected to this topic.

Thus, please stop filing new topics. We can use this one to track.

ynjiun · September 9, 2020, 5:13pm

Hi Wayne,

The summary is in this post

It links to all the posts I had filed on this issue. It seems all the issues so far I had filed linked to one symptoms (not the root cause) which the CPU 1 loading is inching up all the way to 100% overtime or near to 100% and then crash.

It could be (my guessing) some part the system keep firing irq and inundate the CPU (that is the load is getting higher and higher over time). The suspected part (could be s/w or h/w) are:

power management: bpmp, etc.
network: eqos, etc.
gpu : nvgpu, etc.
others,
eventually causing CPU stalled, then kernel panic - not syncing: softlockup

Well that’s my two cents guessing, but no clue what causing these symptoms. My setup is extremely simple (Display+keybord+mouse+ethernet) no other sensors. The unit uses the 65W power supply come with the product and plug into a 600W line conditioner exclusive for AGX Xavier only (no other device plug in). Power mode setting MAXN and “sudo nvpmodel -d cool” to keep the fan running. The system can still self reboot without any apps running. Yesterday for example, turn on around 9:00am, self reboot around 12:15 noon, then 2nd self reboot around 12:45pm (still nothing running), then 3rd self reboot around 1:15pm (still no apps running) and 4th self reboot around 4:30pm. All the console logs and tegrastats logs can be found in this post

Thank you for your following up.

simon.glet · September 9, 2020, 10:05pm

Hey @ynjiun

Now that you have the fan mode set to cool, If you “dmesg --follow” or “tail -f /var/log/syslog”, do you see something like this:

“[ 3785.834613] FAN rising trip_level:1 cur_temp:35000 trip_temps[2]:53000
[ 5261.947583] FAN cooling trip_level:0 cur_temp:25800 trip_temps[1]:35000
[ 7076.304136] FAN rising trip_level:1 cur_temp:35000 trip_temps[2]:53000
[31135.117898] FAN cooling trip_level:0 cur_temp:25800 trip_temps[1]:35000”

Cheers
Simon

WayneWWW · September 10, 2020, 2:42am

Hi ynjiun,

Could you check /proc/interrupts and see if any abnormal interrupt?

Actually, I think it might be good to RMA this device. Will you see eqos issue if you just run “stress” to push CPU loading?

simon.glet · September 10, 2020, 4:40pm

Hi @WayneWWW,

As I have the same board/software version and issues as @ynjiun, here is the results of /proc/interrupts:

        CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7

2: 0 0 3: 3139135 976517 6: 1521184 0 7: 0 0 8: 0 0 9: 0 0 10: 0 0 11: 0 0 12: 0 0 13: 0 0 14: 0 0 15: 0 0 16: 0 0 17: 0 0 18: 0 0 19: 0 0 20: 0 0 21: 0 0 22: 0 0 23: 0 0 24: 0 0 25: 0 0 26: 0 0 27: 0 0 28: 0 0 29: 0 0 30: 0 0 31: 0 0 32: 0 0 33: 0 0 34: 0 0 35: 1 0 36: 0 0 37: 0 0 38: 0 0 39: 0 0 40: 0 0 41: 1 0 43: 224191 0 44: 206187 0 51: 28 0 54: 0 0 55: 20123 0 56: 0 0 57: 0 0 58: 0 0 59: 0 0 60: 0 0 61: 2186762 0 62: 0 0 63: 0 0 64: 0 0 65: 3 0 66: 529 0 67: 0 0 70: 357 0 71: 0 0 72: 115 0 73: 14833 0 74: 2 0 75: 0 0 76: 0 0 77: 0 0 78: 19949 0 79: 0 0 80: 0 0 81: 0 0 82: 0 0 85: 4 0 87: 0 0 88: 0 0 97: 0 0 98: 0 0 99: 0 0 100: 0 0 101: 0 0 103: 4 0 114: 31 0 118: 0 0 120: 0 0 121: 0 0 122: 0 0 123: 0 0 124: 0 0 125: 0 0 126: 3 0 127: 3 0 128: 0 0 129: 0 0 130: 0 0 131: 0 0 132: 0 0 133: 0 0 134: 0 0 135: 0 0 136: 0 0 137: 0 0 138: 0 0 139: 0 0 140: 0 0 141: 0 0 142: 0 0 143: 0 0 144: 0 0 145: 0 0 146: 0 0 147: 0 0 148: 0 0 149: 0 0 150: 0 0 248: 0 0 252: 2 0 255: 0 0 258: 0 0 298: 1 0 349: 2 0 392: 857416 0 438: 17 0 460: 0 0 464: 0 0 468: 3321 0 472: 0 0 473: 48 0 474: 1 0 475: 0 0 476: 10999 0 477: 0 0 478: 0 0 479: 0 0 480: 0 0 481: 0 0 482: 0 0 483: 0 0 484: 0 0 485: 0 0 486: 0 0 487: 0 0 488: 0 0 489: 0 0 490: 0 0 491: 0 0 492: 0 0 493: 0 0 494: 0 0 495: 0 0 497: 0 0 501: 0 0 502: 0 0 506: 0 0 507: 0 0 530: 56 0 531: 54 0 562: 0 0 564: 0 0 IPI0: 161874 IPI1: 169821 IPI2: 0 IPI3: 0 IPI4: 11765 IPI5: 0 Err: 0 0 0 0 0 0 0 GICv2 29 Level trusty
408455 378605 1450401 1813446 1221013 2805618 GICv2 30 Level arch_timer
0 0 0 0 0 0 GICv2 208 Level hsp
0 0 0 0 0 0 GICv2 202 Level arm-smmu global fault
0 0 0 0 0 0 GICv2 203 Level arm-smmu global fault
0 0 0 0 0 0 GICv2 264 Level arm-smmu global fault
0 0 0 0 0 0 GICv2 265 Level arm-smmu global fault
0 0 0 0 0 0 GICv2 272 Level arm-smmu global fault
0 0 0 0 0 0 GICv2 273 Level arm-smmu global fault
0 0 0 0 0 0 GICv2 368 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 369 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 370 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 371 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 372 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 373 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 374 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 375 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 376 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 377 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 253 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 254 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 378 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 379 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 380 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 381 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 382 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 383 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 235 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 252 Level tegra-p2u-intr
0 0 0 0 0 0 GICv2 104 Level tegra-pcie-intr
0 0 0 0 0 0 GICv2 105 Level tegra-pcie-msi
0 0 0 0 0 0 GICv2 77 Level tegra-pcie-intr, PCIe PME, aerdrv
0 0 0 0 0 0 GICv2 78 Level tegra-pcie-msi
0 0 0 0 0 0 GICv2 81 Level tegra-pcie-intr
0 0 0 0 0 0 GICv2 82 Level tegra-pcie-msi
0 0 0 0 0 0 GICv2 85 Level tegra-pcie-intr
0 0 0 0 0 0 GICv2 86 Level tegra-pcie-msi
0 0 0 0 0 0 GICv2 226 Level ether_qos.common_irq
0 0 0 0 0 0 GICv2 222 Level 2490000.ether_qos.rx0
0 0 0 0 0 0 GICv2 218 Level 2490000.ether_qos.tx0
0 0 0 0 0 0 GICv2 144 Level 3100000.serial
0 0 0 0 0 0 GICv2 152 Level combined_uart rx
0 0 0 0 0 0 GICv2 97 Level mmc0
0 0 0 0 0 0 GICv2 94 Level mmc1
0 0 0 0 0 0 GICv2 76 Level ufshcd
0 0 0 0 0 0 GICv2 68 Level 3210000.spi
0 0 0 0 0 0 GICv2 69 Level c260000.spi
0 0 0 0 0 0 GICv2 57 Level 3160000.i2c
0 0 0 0 0 0 GICv2 58 Level c240000.i2c
0 0 0 0 0 0 GICv2 59 Level 3180000.i2c
0 0 0 0 0 0 GICv2 60 Level 3190000.i2c
0 0 0 0 0 0 GICv2 62 Level 31b0000.i2c
0 0 0 0 0 0 GICv2 63 Level 31c0000.i2c
0 0 0 0 0 0 GICv2 64 Level c250000.i2c
0 0 0 0 0 0 GICv2 65 Level 31e0000.i2c
0 0 0 0 0 0 GICv2 193 Level snd_hda_tegra
0 0 0 0 0 0 GICv2 51 Level bc00000.rtcpu
0 0 0 0 0 0 GICv2 242 Level d230000.actmon
0 0 0 0 0 0 GICv2 297 Level host_syncpt
0 0 0 0 0 0 GICv2 295 Level host_status
0 0 0 0 0 0 GICv2 238 Level vic
0 0 0 0 0 0 GICv2 268 Level nvdla0
0 0 0 0 0 0 GICv2 269 Level nvdla1
0 0 0 0 0 0 GICv2 185 Level 15200000.nvdisplay
0 0 0 0 0 0 GICv2 186 Level 15210000.nvdisplay
0 0 0 0 0 0 GICv2 187 Level 15220000.nvdisplay
0 0 0 0 0 0 GICv2 191 Level tegra_dp
0 0 0 0 0 0 GICv2 192 Level tegra_dp
0 0 0 0 0 0 GICv2 194 Level cec_irq
0 0 0 0 0 0 GICv2 266 Level pva-isr
0 0 0 0 0 0 GICv2 267 Level pva-isr
0 0 0 0 0 0 GICv2 397 Level carmel-pmu
0 0 0 0 0 0 GICv2 270 Level noc_nonsecure_irq
0 0 0 0 0 0 GICv2 271 Level noc_secure_irq
0 0 0 0 0 0 PM 42 Level tegra_rtc
0 0 0 0 0 0 GICv2 255 Level mc_status
0 0 0 0 0 0 GICv2 165 Level c150000.tegra-hsp
0 0 0 0 0 0 GICv2 214 Level b950000.tegra-hsp, b950000.tegra-hsp, b950000.tegra-hsp
0 0 0 0 0 0 GICv2 315 Level 3ad0000.se_elp
0 0 0 0 0 0 GICv2 108 Level gpcdma.0
0 0 0 0 0 0 GICv2 109 Level gpcdma.1
0 0 0 0 0 0 GICv2 110 Level gpcdma.2
0 0 0 0 0 0 GICv2 111 Level gpcdma.3
0 0 0 0 0 0 GICv2 112 Level gpcdma.4
0 0 0 0 0 0 GICv2 113 Level gpcdma.5
0 0 0 0 0 0 GICv2 114 Level gpcdma.6
0 0 0 0 0 0 GICv2 115 Level gpcdma.7
0 0 0 0 0 0 GICv2 116 Level gpcdma.8
0 0 0 0 0 0 GICv2 117 Level gpcdma.9
0 0 0 0 0 0 GICv2 118 Level gpcdma.10
0 0 0 0 0 0 GICv2 119 Level gpcdma.11
0 0 0 0 0 0 GICv2 120 Level gpcdma.12
0 0 0 0 0 0 GICv2 121 Level gpcdma.13
0 0 0 0 0 0 GICv2 122 Level gpcdma.14
0 0 0 0 0 0 GICv2 123 Level gpcdma.15
0 0 0 0 0 0 GICv2 124 Level gpcdma.16
0 0 0 0 0 0 GICv2 125 Level gpcdma.17
0 0 0 0 0 0 GICv2 126 Level gpcdma.18
0 0 0 0 0 0 GICv2 127 Level gpcdma.19
0 0 0 0 0 0 GICv2 128 Level gpcdma.20
0 0 0 0 0 0 GICv2 129 Level gpcdma.21
0 0 0 0 0 0 GICv2 130 Level gpcdma.22
0 0 0 0 0 0 GICv2 131 Level gpcdma.23
0 0 0 0 0 0 GICv2 132 Level gpcdma.24
0 0 0 0 0 0 GICv2 133 Level gpcdma.25
0 0 0 0 0 0 GICv2 134 Level gpcdma.26
0 0 0 0 0 0 GICv2 135 Level gpcdma.27
0 0 0 0 0 0 GICv2 136 Level gpcdma.28
0 0 0 0 0 0 GICv2 137 Level gpcdma.29
0 0 0 0 0 0 GICv2 138 Level gpcdma.30
0 0 0 0 0 0 tegra-gpio 48 Edge force-recovery
0 0 0 0 0 0 tegra-gpio 52 Level phy_interrupt
0 0 0 0 0 0 tegra-gpio 55 Edge 3400000.sdhci cd
0 0 0 0 0 0 tegra-gpio 58 Level tmp451
0 0 0 0 0 0 tegra-gpio 98 Edge 15200000.nvdisplay
0 0 0 0 0 0 tegra-gpio 149 Edge rt5659
0 0 0 0 0 0 tegra-gpio 192 Edge bluetooth hostwake
0 0 0 0 0 0 tegra-gpio-aon 10 Level ccg_irq
0 0 0 0 0 0 tegra-gpio-aon 32 Edge ufs_cd_gpio
0 0 0 0 0 0 tegra-gpio-aon 36 Edge power-key
0 0 0 0 0 0 GICv2 39 Level 30c0000.watchdog
0 0 0 0 0 0 GICv2 198 Level 3550000.xudc
0 0 0 0 0 0 PM 195 Level xhci-hcd:usb1
0 0 0 0 0 0 PM 196 Level 3610000.xhci
0 0 0 0 0 0 PM 199 Level 3610000.xhci
0 0 0 0 0 0 GICv2 102 Level gk20a_stall
0 0 0 0 0 0 GICv2 103 Level gk20a_nonstall
0 0 0 0 0 0 GICv2 424 Level ras-fhi
0 0 0 0 0 0 GICv2 425 Level ras-fhi
0 0 0 0 0 0 GICv2 426 Level ras-fhi
0 0 0 0 0 0 GICv2 427 Level ras-fhi
0 0 0 0 0 0 GICv2 428 Level ras-fhi
0 0 0 0 0 0 GICv2 429 Level ras-fhi
0 0 0 0 0 0 GICv2 430 Level ras-fhi
0 0 0 0 0 0 GICv2 431 Level ras-fhi
0 0 0 0 0 0 GICv2 262 Level noc_nonsecure_irq
0 0 0 0 0 0 GICv2 263 Level noc_secure_irq
0 0 0 0 0 0 GICv2 292 Level noc_nonsecure_irq
0 0 0 0 0 0 GICv2 204 Level noc_secure_irq
0 0 0 0 0 0 GICv2 294 Level noc_nonsecure_irq
0 0 0 0 0 0 GICv2 206 Level noc_secure_irq
0 0 0 0 0 0 GICv2 291 Level noc_nonsecure_irq
0 0 0 0 0 0 GICv2 207 Level noc_secure_irq
0 0 0 0 0 0 GICv2 293 Level noc_nonsecure_irq
0 0 0 0 0 0 GICv2 205 Level noc_secure_irq
0 0 0 0 0 0 PM 241 Edge max77620-top
0 0 0 0 0 0 max77620-top 3 Edge max77620-gpio
0 0 0 0 0 0 max77620-top 4 Edge max77686-rtc
0 0 0 0 0 0 max77620-top 8 Edge max77620-thermal
0 0 0 0 0 0 max77620-top 9 Edge max77620-thermal
0 0 0 0 0 0 agic-controller 32 Level
0 0 0 0 0 0 agic-controller 33 Level
0 0 0 0 0 0 max77686-rtc 1 Edge rtc-alarm1
0 0 0 0 0 0 PCI-MSI 0 Edge ahci[0001:01:00.0]
364263 283705 493579 129390 126582 64037 57948 Rescheduling interrupts
171101 123464 53741 174590 172057 172061 172728 Function call interrupts
0 0 0 0 0 0 0 CPU stop interrupts
0 0 0 0 0 0 0 Timer broadcast interrupts
3819 30586 63298 24112 24970 5405 5181 IRQ work interrupts
0 0 0 0 0 0 0 CPU wake-up interrupts

Thanks
Simon

simon.glet · September 10, 2020, 5:17pm

Hi,

I found a way to crash the Jetson AGX Xavier DevKit:
1 - run all executables in : /usr/src/nvidia/graphics_demos/prebuilts/bin/x11
2 - spread them nicely to occupy the whole screen
3 - wait 30 minutes

I was running a couple of utilities from a remote station at the same time so here is the last logging:

dmesg --follow
[101371.217264] INFO: rcu_sched detected stalls on CPUs/tasks:
[101371.217431] 0-…: (1 GPs behind) idle=bf9/140000000000002/0 softirq=1680728/1680729 fqs=2504
[101371.217592] (detected by 2, t=5252 jiffies, g=74582, c=74581, q=6)
[101371.217723] Task dump for CPU 0:
[101371.217729] nvgpu_channel_p R running task 0 5899 2 0x00000002
[101371.217747] Call trace:
[101371.217781] [] __switch_to+0x9c/0xc0
[101371.217794] [] 0xffffffc7c23c1408
[101371.337270] INFO: rcu_preempt self-detected stall on CPU
[101371.337427] 0-…: (1 GPs behind) idle=bf9/140000000000002/0 softirq=1680704/1680729 fqs=2428
[101371.337580] (t=5251 jiffies g=502986 c=502985 q=9082)
[101371.337686] Task dump for CPU 0:
[101371.337694] nvgpu_channel_p R running task 0 5899 2 0x00000002
[101371.337712] Call trace:
[101371.337741] [] dump_backtrace+0x0/0x198
[101371.337756] [] show_stack+0x24/0x30
[101371.337770] [] sched_show_task+0xf8/0x148
[101371.337782] [] dump_cpu_task+0x48/0x58
[101371.337795] [] rcu_dump_cpu_stacks+0xb8/0xec
[101371.337808] [] rcu_check_callbacks+0x728/0xa48
[101371.337820] [] update_process_times+0x34/0x60
[101371.337834] [] tick_sched_handle.isra.5+0x38/0x70
[101371.337844] [] tick_sched_timer+0x4c/0x90
[101371.337855] [] __hrtimer_run_queues+0xd8/0x360
[101371.337865] [] hrtimer_interrupt+0xa8/0x1e0
[101371.337878] [] arch_timer_handler_phys+0x38/0x58
[101371.337891] [] handle_percpu_devid_irq+0x90/0x2b0
[101371.337902] [] generic_handle_irq+0x34/0x50
[101371.337911] [] __handle_domain_irq+0x68/0xc0
[101371.337922] [] gic_handle_irq+0x5c/0xb0
[101371.337932] [] el1_irq+0xe8/0x194
[101371.337942] [] update_blocked_averages+0x678/0x1f18
[101371.337954] [] rebalance_domains+0x4c/0x2c8
[101371.337964] [] run_rebalance_domains+0x154/0x218
[101371.337974] [] __do_softirq+0x13c/0x3b0
[101371.337987] [] irq_exit+0xd0/0x118
[101371.337997] [] __handle_domain_irq+0x6c/0xc0
top
top - 13:06:30 up 1 day, 4:09, 6 users, load average: 4.89, 3.95, 2.69
Tasks: 390 total, 7 running, 383 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.9 us, 21.8 sy, 0.0 ni, 47.7 id, 1.3 wa, 12.9 hi, 0.5 si, 0.0 st
KiB Mem : 32691652 total, 29709548 free, 2030676 used, 951428 buff/cache
KiB Swap: 16345792 total, 16345792 free, 0 used. 30281260 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16459 simon 20 0 59272 29864 21856 D 77.6 0.1 18:04.17 ctree
15229 root 20 0 24.170g 45668 24548 S 48.5 0.1 14:48.25 Xorg
16497 simon 20 0 46684 25880 20744 R 40.6 0.1 0:54.37 eglstreamcube
15826 simon 20 0 381020 38700 25604 S 25.7 0.1 6:41.34 vino-server
15728 simon 20 0 1159312 137868 67432 S 24.8 0.4 4:03.10 compiz
16456 simon 20 0 52840 31460 21252 R 22.8 0.1 1:11.53 bubble
16523 simon 20 0 47152 25152 20108 R 18.5 0.1 1:03.46 gears
16510 simon 20 0 48484 25928 19784 R 17.5 0.1 1:04.91 gearscube
985 root -51 0 0 0 0 S 12.5 0.0 3:12.99 irq/73-host_syn
5899 root 20 0 0 0 0 R 4.6 0.0 1:42.42 nvgpu_channel_p
2565 root -51 0 0 0 0 S 2.3 0.0 0:42.57 irq/476-gk20a_s
16504 root 20 0 0 0 0 S 2.0 0.0 0:18.61 kworker/u16:5
16723 root 20 0 0 0 0 S 1.7 0.0 0:02.07 kworker/u16:1
3 root 20 0 0 0 0 S 1.0 0.0 0:01.44 ksoftirqd/0
tegrastats
RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,28%@2265,29%@2265,35%@2265,36%@2265,67%@2265,38%@2265,100%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@40C GPU@41.5C Tdiode@42.75C PMIC@100C AUX@40C CPU@43C thermal@41.65C Tboard@39C GPU 3707/3748 CPU 6334/5558 SOC 3400/3584 CV 154/154 VDDRQ 1235/1527 SYS5V 2764/2865
RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,50%@2265,30%@2265,66%@2265,22%@2265,38%@2265,33%@2265,93%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@40C GPU@41.5C Tdiode@42.75C PMIC@100C AUX@40C CPU@42.5C thermal@41.65C Tboard@39C GPU 2783/3747 CPU 5568/5558 SOC 3093/3583 CV 154/154 VDDRQ 1081/1527 SYS5V 2644/2865
RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,100%@2265,8%@2265,2%@2265,2%@2265,100%@2265,0%@2265,0%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@39.5C GPU@41C Tdiode@42.75C PMIC@100C AUX@40C CPU@42C thermal@41.2C Tboard@39C GPU 1084/3743 CPU 4338/5556 SOC 2635/3582 CV 154/154 VDDRQ 619/1526 SYS5V 2443/2864
RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,100%@2265,3%@2265,6%@2265,0%@2265,100%@2265,0%@2265,1%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@39.5C GPU@40.5C Tdiode@42.25C PMIC@100C AUX@39.5C CPU@42C thermal@40.75C Tboard@39C GPU 929/3740 CPU 4185/5555 SOC 2635/3581 CV 155/154 VDDRQ 464/1524 SYS5V 2443/2864
RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,100%@2265,5%@2265,0%@2265,0%@2265,100%@2265,0%@2265,0%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@39C GPU@40.5C Tdiode@42.25C PMIC@100C AUX@39.5C CPU@41.5C thermal@40.75C Tboard@39C GPU 929/3736 CPU 4185/5553 SOC 2635/3580 CV 155/154 VDDRQ 464/1523 SYS5V 2443/2863

As some point the unit froze and has since rebooted over and over…

I will RMA this unit ASAP.

Thanks
Simon

Topic		Replies	Views
AGX Xavier power supply: very sensitive to voltage variation Jetson AGX Xavier power , nvbugs	31	3218	October 18, 2021
AGX Xavier Dev Kit Ethernet Port - RX Drop Jetson AGX Xavier ethernet	9	615	February 21, 2023
AGX Xavier freeze in MAXN mode Jetson AGX Xavier power	36	5373	October 18, 2021
Unstable performance across multiple Jetson AGX Xavier devices DeepStream SDK fps , jetson , deepstream	5	820	October 30, 2023
Jetson Xavier AGX nvgpu_timeout_expired Jetson AGX Xavier	30	1674	December 29, 2020
Boot AGX Xavier Jetson AGX Xavier boot , board-design	10	940	April 29, 2022
AGX Xavier kept rebooting after crash Jetson AGX Xavier boot	11	1975	October 18, 2021
Network connection loss when TX ring full Jetson AGX Xavier ethernet	18	3041	October 18, 2021
Jetson AGX Xavier self rebooting Jetson AGX Xavier boot	53	5697	October 18, 2021
Xavier Fan Trouble (re-enable DVFS) Jetson AGX Xavier	49	3427	October 18, 2021

AGX Xavier easy to crash when ethernet network connected

Related topics