AGX Xavier easy to crash when ethernet network connected

Yes, you are right.

I’m tring to reproduce, will back later.

After 6 hours test with network on, I haven’t got the kernel panic.
I tried with JP 4.4 on xavier Devkit.

Hi @alanz

We are not exactly sure what is causing the reboots but it is happening. You might want to have a look at Jetson AGX Xavier self rebooting - #46 by simon.glet.

Thanks
Simon

Hi alanz, I am curious what’s your duplication environment:

  1. do you connect AGX to a monitor (display) or headless? if it’s headless, what do you use to connect the unit? ssh? or VNC?
  2. what’s the power mode? MAXN? or other?
  3. did you run “sodu jetson_clocks” before your testing?or not?
  4. did you run any apps in this 6 hours?
  5. what’s your JP version “head -1 /etc/nv_tegra_release”?
  6. what’s the GPU temperature during the running?
  7. did you ever encounter “INFO: rcu_sched detected stalls on CPUs/tasks: 0” during 6 hours?

Thanks for these information. This can calibrate between what’s the difference between your system vs. ours.

Attached more self reboot console log last night (it constantly happened) multiple_self_reboot.log (529.7 KB)
When self rebooting constantly happens, I noticed few things:

  1. GPU/CPU/thermal temeprature > 35C (even running no apps) in 28C room temperature.
  2. CPU 1 loading > 98% almost always at 100% don’t know what’s running although the unit does not run any apps.
  3. the unit will go into a mode that constantly reboot itself every few minutes. And I have to shut it down by pulling the plug and leave it overnight (I cannot work on this unit anymore…)

This morning, when I turn on the unit, all CPU/GPU/thermal < 32C, CPU 1 loading < 10%, everything seems stable and normal.

What does this imply? I have been suspecting the thermal sensitive of this unit for a long time, but never can “duplicate/nail it” in a solid way, when it happens (self rebooting), then it happens consecutively… and need to wait to next day to “clear” it up. Very strange behaviour. (basically it’s not usable anymore…; (

Hi @ynjiun

I think you are on to something with the temperature.

The default fan setting is quiet which has a trip temp of 46C. I changed the setting to cool which has a trip temperature of 35C with:
sudo nvpmodel -d cool

Since then, the devkit has been playing youtube HD full screen videos non-stop with no issue.

Here is the latest tegrastats:
RAM 2440/31925MB (lfb 6939x4MB) SWAP 0/15963MB (cached 0MB) CPU [31%@2265,27%@2265,22%@2265,24%@2265,31%@2265,38%@2265,36%@2265,43%@2265] EMC_FREQ 0% GR3D_FREQ 28% AO@34C GPU@34.5C Tdiode@36.5C PMIC@100C AUX@34C CPU@36C thermal@34.95C Tboard@34C GPU 619/670 CPU 4183/3586 SOC 2788/2544 CV 154/154 VDDRQ 929/897 SYS5V 2564/2474

Cheers
Simon

It looks like this is probably fixed here:
https://forums.developer.nvidia.com/t/xavier-with-jp4-2-hangs/72014/8

Hi ynjiun,

For your case, could you give us a summary of how many issues you’ve filed?

It looks like all of them are connected but not separate issues…

For example, I saw you have below topic too. Plus the previous “power supply” issue I saw. You’ve filed 3 topics and all of them are same to me.

As I pointed out in the power supply topic, you always see kernel panic before system reboots. And that kernel panic is from ethernet driver. That is also connected to this topic.

Thus, please stop filing new topics. We can use this one to track.

Hi Wayne,

The summary is in this post

It links to all the posts I had filed on this issue. It seems all the issues so far I had filed linked to one symptoms (not the root cause) which the CPU 1 loading is inching up all the way to 100% overtime or near to 100% and then crash.

It could be (my guessing) some part the system keep firing irq and inundate the CPU (that is the load is getting higher and higher over time). The suspected part (could be s/w or h/w) are:

  1. power management: bpmp, etc.
  2. network: eqos, etc.
  3. gpu : nvgpu, etc.
  4. others,
    eventually causing CPU stalled, then kernel panic - not syncing: softlockup

Well that’s my two cents guessing, but no clue what causing these symptoms. My setup is extremely simple (Display+keybord+mouse+ethernet) no other sensors. The unit uses the 65W power supply come with the product and plug into a 600W line conditioner exclusive for AGX Xavier only (no other device plug in). Power mode setting MAXN and “sudo nvpmodel -d cool” to keep the fan running. The system can still self reboot without any apps running. Yesterday for example, turn on around 9:00am, self reboot around 12:15 noon, then 2nd self reboot around 12:45pm (still nothing running), then 3rd self reboot around 1:15pm (still no apps running) and 4th self reboot around 4:30pm. All the console logs and tegrastats logs can be found in this post

Thank you for your following up.

Hey @ynjiun

Now that you have the fan mode set to cool, If you “dmesg --follow” or “tail -f /var/log/syslog”, do you see something like this:

“[ 3785.834613] FAN rising trip_level:1 cur_temp:35000 trip_temps[2]:53000
[ 5261.947583] FAN cooling trip_level:0 cur_temp:25800 trip_temps[1]:35000
[ 7076.304136] FAN rising trip_level:1 cur_temp:35000 trip_temps[2]:53000
[31135.117898] FAN cooling trip_level:0 cur_temp:25800 trip_temps[1]:35000”

Cheers
Simon

Hi ynjiun,

Could you check /proc/interrupts and see if any abnormal interrupt?

Actually, I think it might be good to RMA this device. Will you see eqos issue if you just run “stress” to push CPU loading?

Hi @WayneWWW,

As I have the same board/software version and issues as @ynjiun, here is the results of /proc/interrupts:

        CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7

2: 0 0 0 0 0 0 0 0 GICv2 29 Level trusty
3: 3139135 976517 408455 378605 1450401 1813446 1221013 2805618 GICv2 30 Level arch_timer
6: 1521184 0 0 0 0 0 0 0 GICv2 208 Level hsp
7: 0 0 0 0 0 0 0 0 GICv2 202 Level arm-smmu global fault
8: 0 0 0 0 0 0 0 0 GICv2 203 Level arm-smmu global fault
9: 0 0 0 0 0 0 0 0 GICv2 264 Level arm-smmu global fault
10: 0 0 0 0 0 0 0 0 GICv2 265 Level arm-smmu global fault
11: 0 0 0 0 0 0 0 0 GICv2 272 Level arm-smmu global fault
12: 0 0 0 0 0 0 0 0 GICv2 273 Level arm-smmu global fault
13: 0 0 0 0 0 0 0 0 GICv2 368 Level tegra-p2u-intr
14: 0 0 0 0 0 0 0 0 GICv2 369 Level tegra-p2u-intr
15: 0 0 0 0 0 0 0 0 GICv2 370 Level tegra-p2u-intr
16: 0 0 0 0 0 0 0 0 GICv2 371 Level tegra-p2u-intr
17: 0 0 0 0 0 0 0 0 GICv2 372 Level tegra-p2u-intr
18: 0 0 0 0 0 0 0 0 GICv2 373 Level tegra-p2u-intr
19: 0 0 0 0 0 0 0 0 GICv2 374 Level tegra-p2u-intr
20: 0 0 0 0 0 0 0 0 GICv2 375 Level tegra-p2u-intr
21: 0 0 0 0 0 0 0 0 GICv2 376 Level tegra-p2u-intr
22: 0 0 0 0 0 0 0 0 GICv2 377 Level tegra-p2u-intr
23: 0 0 0 0 0 0 0 0 GICv2 253 Level tegra-p2u-intr
24: 0 0 0 0 0 0 0 0 GICv2 254 Level tegra-p2u-intr
25: 0 0 0 0 0 0 0 0 GICv2 378 Level tegra-p2u-intr
26: 0 0 0 0 0 0 0 0 GICv2 379 Level tegra-p2u-intr
27: 0 0 0 0 0 0 0 0 GICv2 380 Level tegra-p2u-intr
28: 0 0 0 0 0 0 0 0 GICv2 381 Level tegra-p2u-intr
29: 0 0 0 0 0 0 0 0 GICv2 382 Level tegra-p2u-intr
30: 0 0 0 0 0 0 0 0 GICv2 383 Level tegra-p2u-intr
31: 0 0 0 0 0 0 0 0 GICv2 235 Level tegra-p2u-intr
32: 0 0 0 0 0 0 0 0 GICv2 252 Level tegra-p2u-intr
33: 0 0 0 0 0 0 0 0 GICv2 104 Level tegra-pcie-intr
34: 0 0 0 0 0 0 0 0 GICv2 105 Level tegra-pcie-msi
35: 1 0 0 0 0 0 0 0 GICv2 77 Level tegra-pcie-intr, PCIe PME, aerdrv
36: 0 0 0 0 0 0 0 0 GICv2 78 Level tegra-pcie-msi
37: 0 0 0 0 0 0 0 0 GICv2 81 Level tegra-pcie-intr
38: 0 0 0 0 0 0 0 0 GICv2 82 Level tegra-pcie-msi
39: 0 0 0 0 0 0 0 0 GICv2 85 Level tegra-pcie-intr
40: 0 0 0 0 0 0 0 0 GICv2 86 Level tegra-pcie-msi
41: 1 0 0 0 0 0 0 0 GICv2 226 Level ether_qos.common_irq
43: 224191 0 0 0 0 0 0 0 GICv2 222 Level 2490000.ether_qos.rx0
44: 206187 0 0 0 0 0 0 0 GICv2 218 Level 2490000.ether_qos.tx0
51: 28 0 0 0 0 0 0 0 GICv2 144 Level 3100000.serial
54: 0 0 0 0 0 0 0 0 GICv2 152 Level combined_uart rx
55: 20123 0 0 0 0 0 0 0 GICv2 97 Level mmc0
56: 0 0 0 0 0 0 0 0 GICv2 94 Level mmc1
57: 0 0 0 0 0 0 0 0 GICv2 76 Level ufshcd
58: 0 0 0 0 0 0 0 0 GICv2 68 Level 3210000.spi
59: 0 0 0 0 0 0 0 0 GICv2 69 Level c260000.spi
60: 0 0 0 0 0 0 0 0 GICv2 57 Level 3160000.i2c
61: 2186762 0 0 0 0 0 0 0 GICv2 58 Level c240000.i2c
62: 0 0 0 0 0 0 0 0 GICv2 59 Level 3180000.i2c
63: 0 0 0 0 0 0 0 0 GICv2 60 Level 3190000.i2c
64: 0 0 0 0 0 0 0 0 GICv2 62 Level 31b0000.i2c
65: 3 0 0 0 0 0 0 0 GICv2 63 Level 31c0000.i2c
66: 529 0 0 0 0 0 0 0 GICv2 64 Level c250000.i2c
67: 0 0 0 0 0 0 0 0 GICv2 65 Level 31e0000.i2c
70: 357 0 0 0 0 0 0 0 GICv2 193 Level snd_hda_tegra
71: 0 0 0 0 0 0 0 0 GICv2 51 Level bc00000.rtcpu
72: 115 0 0 0 0 0 0 0 GICv2 242 Level d230000.actmon
73: 14833 0 0 0 0 0 0 0 GICv2 297 Level host_syncpt
74: 2 0 0 0 0 0 0 0 GICv2 295 Level host_status
75: 0 0 0 0 0 0 0 0 GICv2 238 Level vic
76: 0 0 0 0 0 0 0 0 GICv2 268 Level nvdla0
77: 0 0 0 0 0 0 0 0 GICv2 269 Level nvdla1
78: 19949 0 0 0 0 0 0 0 GICv2 185 Level 15200000.nvdisplay
79: 0 0 0 0 0 0 0 0 GICv2 186 Level 15210000.nvdisplay
80: 0 0 0 0 0 0 0 0 GICv2 187 Level 15220000.nvdisplay
81: 0 0 0 0 0 0 0 0 GICv2 191 Level tegra_dp
82: 0 0 0 0 0 0 0 0 GICv2 192 Level tegra_dp
85: 4 0 0 0 0 0 0 0 GICv2 194 Level cec_irq
87: 0 0 0 0 0 0 0 0 GICv2 266 Level pva-isr
88: 0 0 0 0 0 0 0 0 GICv2 267 Level pva-isr
97: 0 0 0 0 0 0 0 0 GICv2 397 Level carmel-pmu
98: 0 0 0 0 0 0 0 0 GICv2 270 Level noc_nonsecure_irq
99: 0 0 0 0 0 0 0 0 GICv2 271 Level noc_secure_irq
100: 0 0 0 0 0 0 0 0 PM 42 Level tegra_rtc
101: 0 0 0 0 0 0 0 0 GICv2 255 Level mc_status
103: 4 0 0 0 0 0 0 0 GICv2 165 Level c150000.tegra-hsp
114: 31 0 0 0 0 0 0 0 GICv2 214 Level b950000.tegra-hsp, b950000.tegra-hsp, b950000.tegra-hsp
118: 0 0 0 0 0 0 0 0 GICv2 315 Level 3ad0000.se_elp
120: 0 0 0 0 0 0 0 0 GICv2 108 Level gpcdma.0
121: 0 0 0 0 0 0 0 0 GICv2 109 Level gpcdma.1
122: 0 0 0 0 0 0 0 0 GICv2 110 Level gpcdma.2
123: 0 0 0 0 0 0 0 0 GICv2 111 Level gpcdma.3
124: 0 0 0 0 0 0 0 0 GICv2 112 Level gpcdma.4
125: 0 0 0 0 0 0 0 0 GICv2 113 Level gpcdma.5
126: 3 0 0 0 0 0 0 0 GICv2 114 Level gpcdma.6
127: 3 0 0 0 0 0 0 0 GICv2 115 Level gpcdma.7
128: 0 0 0 0 0 0 0 0 GICv2 116 Level gpcdma.8
129: 0 0 0 0 0 0 0 0 GICv2 117 Level gpcdma.9
130: 0 0 0 0 0 0 0 0 GICv2 118 Level gpcdma.10
131: 0 0 0 0 0 0 0 0 GICv2 119 Level gpcdma.11
132: 0 0 0 0 0 0 0 0 GICv2 120 Level gpcdma.12
133: 0 0 0 0 0 0 0 0 GICv2 121 Level gpcdma.13
134: 0 0 0 0 0 0 0 0 GICv2 122 Level gpcdma.14
135: 0 0 0 0 0 0 0 0 GICv2 123 Level gpcdma.15
136: 0 0 0 0 0 0 0 0 GICv2 124 Level gpcdma.16
137: 0 0 0 0 0 0 0 0 GICv2 125 Level gpcdma.17
138: 0 0 0 0 0 0 0 0 GICv2 126 Level gpcdma.18
139: 0 0 0 0 0 0 0 0 GICv2 127 Level gpcdma.19
140: 0 0 0 0 0 0 0 0 GICv2 128 Level gpcdma.20
141: 0 0 0 0 0 0 0 0 GICv2 129 Level gpcdma.21
142: 0 0 0 0 0 0 0 0 GICv2 130 Level gpcdma.22
143: 0 0 0 0 0 0 0 0 GICv2 131 Level gpcdma.23
144: 0 0 0 0 0 0 0 0 GICv2 132 Level gpcdma.24
145: 0 0 0 0 0 0 0 0 GICv2 133 Level gpcdma.25
146: 0 0 0 0 0 0 0 0 GICv2 134 Level gpcdma.26
147: 0 0 0 0 0 0 0 0 GICv2 135 Level gpcdma.27
148: 0 0 0 0 0 0 0 0 GICv2 136 Level gpcdma.28
149: 0 0 0 0 0 0 0 0 GICv2 137 Level gpcdma.29
150: 0 0 0 0 0 0 0 0 GICv2 138 Level gpcdma.30
248: 0 0 0 0 0 0 0 0 tegra-gpio 48 Edge force-recovery
252: 2 0 0 0 0 0 0 0 tegra-gpio 52 Level phy_interrupt
255: 0 0 0 0 0 0 0 0 tegra-gpio 55 Edge 3400000.sdhci cd
258: 0 0 0 0 0 0 0 0 tegra-gpio 58 Level tmp451
298: 1 0 0 0 0 0 0 0 tegra-gpio 98 Edge 15200000.nvdisplay
349: 2 0 0 0 0 0 0 0 tegra-gpio 149 Edge rt5659
392: 857416 0 0 0 0 0 0 0 tegra-gpio 192 Edge bluetooth hostwake
438: 17 0 0 0 0 0 0 0 tegra-gpio-aon 10 Level ccg_irq
460: 0 0 0 0 0 0 0 0 tegra-gpio-aon 32 Edge ufs_cd_gpio
464: 0 0 0 0 0 0 0 0 tegra-gpio-aon 36 Edge power-key
468: 3321 0 0 0 0 0 0 0 GICv2 39 Level 30c0000.watchdog
472: 0 0 0 0 0 0 0 0 GICv2 198 Level 3550000.xudc
473: 48 0 0 0 0 0 0 0 PM 195 Level xhci-hcd:usb1
474: 1 0 0 0 0 0 0 0 PM 196 Level 3610000.xhci
475: 0 0 0 0 0 0 0 0 PM 199 Level 3610000.xhci
476: 10999 0 0 0 0 0 0 0 GICv2 102 Level gk20a_stall
477: 0 0 0 0 0 0 0 0 GICv2 103 Level gk20a_nonstall
478: 0 0 0 0 0 0 0 0 GICv2 424 Level ras-fhi
479: 0 0 0 0 0 0 0 0 GICv2 425 Level ras-fhi
480: 0 0 0 0 0 0 0 0 GICv2 426 Level ras-fhi
481: 0 0 0 0 0 0 0 0 GICv2 427 Level ras-fhi
482: 0 0 0 0 0 0 0 0 GICv2 428 Level ras-fhi
483: 0 0 0 0 0 0 0 0 GICv2 429 Level ras-fhi
484: 0 0 0 0 0 0 0 0 GICv2 430 Level ras-fhi
485: 0 0 0 0 0 0 0 0 GICv2 431 Level ras-fhi
486: 0 0 0 0 0 0 0 0 GICv2 262 Level noc_nonsecure_irq
487: 0 0 0 0 0 0 0 0 GICv2 263 Level noc_secure_irq
488: 0 0 0 0 0 0 0 0 GICv2 292 Level noc_nonsecure_irq
489: 0 0 0 0 0 0 0 0 GICv2 204 Level noc_secure_irq
490: 0 0 0 0 0 0 0 0 GICv2 294 Level noc_nonsecure_irq
491: 0 0 0 0 0 0 0 0 GICv2 206 Level noc_secure_irq
492: 0 0 0 0 0 0 0 0 GICv2 291 Level noc_nonsecure_irq
493: 0 0 0 0 0 0 0 0 GICv2 207 Level noc_secure_irq
494: 0 0 0 0 0 0 0 0 GICv2 293 Level noc_nonsecure_irq
495: 0 0 0 0 0 0 0 0 GICv2 205 Level noc_secure_irq
497: 0 0 0 0 0 0 0 0 PM 241 Edge max77620-top
501: 0 0 0 0 0 0 0 0 max77620-top 3 Edge max77620-gpio
502: 0 0 0 0 0 0 0 0 max77620-top 4 Edge max77686-rtc
506: 0 0 0 0 0 0 0 0 max77620-top 8 Edge max77620-thermal
507: 0 0 0 0 0 0 0 0 max77620-top 9 Edge max77620-thermal
530: 56 0 0 0 0 0 0 0 agic-controller 32 Level
531: 54 0 0 0 0 0 0 0 agic-controller 33 Level
562: 0 0 0 0 0 0 0 0 max77686-rtc 1 Edge rtc-alarm1
564: 0 0 0 0 0 0 0 0 PCI-MSI 0 Edge ahci[0001:01:00.0]
IPI0: 161874 364263 283705 493579 129390 126582 64037 57948 Rescheduling interrupts
IPI1: 169821 171101 123464 53741 174590 172057 172061 172728 Function call interrupts
IPI2: 0 0 0 0 0 0 0 0 CPU stop interrupts
IPI3: 0 0 0 0 0 0 0 0 Timer broadcast interrupts
IPI4: 11765 3819 30586 63298 24112 24970 5405 5181 IRQ work interrupts
IPI5: 0 0 0 0 0 0 0 0 CPU wake-up interrupts
Err: 0

Thanks
Simon

Hi,

I found a way to crash the Jetson AGX Xavier DevKit:
1 - run all executables in : /usr/src/nvidia/graphics_demos/prebuilts/bin/x11
2 - spread them nicely to occupy the whole screen
3 - wait 30 minutes

I was running a couple of utilities from a remote station at the same time so here is the last logging:

  • dmesg --follow
    [101371.217264] INFO: rcu_sched detected stalls on CPUs/tasks:
    [101371.217431] 0-…: (1 GPs behind) idle=bf9/140000000000002/0 softirq=1680728/1680729 fqs=2504
    [101371.217592] (detected by 2, t=5252 jiffies, g=74582, c=74581, q=6)
    [101371.217723] Task dump for CPU 0:
    [101371.217729] nvgpu_channel_p R running task 0 5899 2 0x00000002
    [101371.217747] Call trace:
    [101371.217781] [] __switch_to+0x9c/0xc0
    [101371.217794] [] 0xffffffc7c23c1408
    [101371.337270] INFO: rcu_preempt self-detected stall on CPU
    [101371.337427] 0-…: (1 GPs behind) idle=bf9/140000000000002/0 softirq=1680704/1680729 fqs=2428
    [101371.337580] (t=5251 jiffies g=502986 c=502985 q=9082)
    [101371.337686] Task dump for CPU 0:
    [101371.337694] nvgpu_channel_p R running task 0 5899 2 0x00000002
    [101371.337712] Call trace:
    [101371.337741] [] dump_backtrace+0x0/0x198
    [101371.337756] [] show_stack+0x24/0x30
    [101371.337770] [] sched_show_task+0xf8/0x148
    [101371.337782] [] dump_cpu_task+0x48/0x58
    [101371.337795] [] rcu_dump_cpu_stacks+0xb8/0xec
    [101371.337808] [] rcu_check_callbacks+0x728/0xa48
    [101371.337820] [] update_process_times+0x34/0x60
    [101371.337834] [] tick_sched_handle.isra.5+0x38/0x70
    [101371.337844] [] tick_sched_timer+0x4c/0x90
    [101371.337855] [] __hrtimer_run_queues+0xd8/0x360
    [101371.337865] [] hrtimer_interrupt+0xa8/0x1e0
    [101371.337878] [] arch_timer_handler_phys+0x38/0x58
    [101371.337891] [] handle_percpu_devid_irq+0x90/0x2b0
    [101371.337902] [] generic_handle_irq+0x34/0x50
    [101371.337911] [] __handle_domain_irq+0x68/0xc0
    [101371.337922] [] gic_handle_irq+0x5c/0xb0
    [101371.337932] [] el1_irq+0xe8/0x194
    [101371.337942] [] update_blocked_averages+0x678/0x1f18
    [101371.337954] [] rebalance_domains+0x4c/0x2c8
    [101371.337964] [] run_rebalance_domains+0x154/0x218
    [101371.337974] [] __do_softirq+0x13c/0x3b0
    [101371.337987] [] irq_exit+0xd0/0x118
    [101371.337997] [] __handle_domain_irq+0x6c/0xc0

  • top
    top - 13:06:30 up 1 day, 4:09, 6 users, load average: 4.89, 3.95, 2.69
    Tasks: 390 total, 7 running, 383 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 15.9 us, 21.8 sy, 0.0 ni, 47.7 id, 1.3 wa, 12.9 hi, 0.5 si, 0.0 st
    KiB Mem : 32691652 total, 29709548 free, 2030676 used, 951428 buff/cache
    KiB Swap: 16345792 total, 16345792 free, 0 used. 30281260 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    16459 simon 20 0 59272 29864 21856 D 77.6 0.1 18:04.17 ctree
    15229 root 20 0 24.170g 45668 24548 S 48.5 0.1 14:48.25 Xorg
    16497 simon 20 0 46684 25880 20744 R 40.6 0.1 0:54.37 eglstreamcube
    15826 simon 20 0 381020 38700 25604 S 25.7 0.1 6:41.34 vino-server
    15728 simon 20 0 1159312 137868 67432 S 24.8 0.4 4:03.10 compiz
    16456 simon 20 0 52840 31460 21252 R 22.8 0.1 1:11.53 bubble
    16523 simon 20 0 47152 25152 20108 R 18.5 0.1 1:03.46 gears
    16510 simon 20 0 48484 25928 19784 R 17.5 0.1 1:04.91 gearscube
    985 root -51 0 0 0 0 S 12.5 0.0 3:12.99 irq/73-host_syn
    5899 root 20 0 0 0 0 R 4.6 0.0 1:42.42 nvgpu_channel_p
    2565 root -51 0 0 0 0 S 2.3 0.0 0:42.57 irq/476-gk20a_s
    16504 root 20 0 0 0 0 S 2.0 0.0 0:18.61 kworker/u16:5
    16723 root 20 0 0 0 0 S 1.7 0.0 0:02.07 kworker/u16:1
    3 root 20 0 0 0 0 S 1.0 0.0 0:01.44 ksoftirqd/0

  • tegrastats
    RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,28%@2265,29%@2265,35%@2265,36%@2265,67%@2265,38%@2265,100%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@40C GPU@41.5C Tdiode@42.75C PMIC@100C AUX@40C CPU@43C thermal@41.65C Tboard@39C GPU 3707/3748 CPU 6334/5558 SOC 3400/3584 CV 154/154 VDDRQ 1235/1527 SYS5V 2764/2865
    RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,50%@2265,30%@2265,66%@2265,22%@2265,38%@2265,33%@2265,93%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@40C GPU@41.5C Tdiode@42.75C PMIC@100C AUX@40C CPU@42.5C thermal@41.65C Tboard@39C GPU 2783/3747 CPU 5568/5558 SOC 3093/3583 CV 154/154 VDDRQ 1081/1527 SYS5V 2644/2865
    RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,100%@2265,8%@2265,2%@2265,2%@2265,100%@2265,0%@2265,0%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@39.5C GPU@41C Tdiode@42.75C PMIC@100C AUX@40C CPU@42C thermal@41.2C Tboard@39C GPU 1084/3743 CPU 4338/5556 SOC 2635/3582 CV 154/154 VDDRQ 619/1526 SYS5V 2443/2864
    RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,100%@2265,3%@2265,6%@2265,0%@2265,100%@2265,0%@2265,1%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@39.5C GPU@40.5C Tdiode@42.25C PMIC@100C AUX@39.5C CPU@42C thermal@40.75C Tboard@39C GPU 929/3740 CPU 4185/5555 SOC 2635/3581 CV 155/154 VDDRQ 464/1524 SYS5V 2443/2864
    RAM 2072/31925MB (lfb 7206x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@2265,100%@2265,5%@2265,0%@2265,0%@2265,100%@2265,0%@2265,0%@2265] EMC_FREQ 0% GR3D_FREQ 0% AO@39C GPU@40.5C Tdiode@42.25C PMIC@100C AUX@39.5C CPU@41.5C thermal@40.75C Tboard@39C GPU 929/3736 CPU 4185/5553 SOC 2635/3580 CV 155/154 VDDRQ 464/1523 SYS5V 2443/2863

As some point the unit froze and has since rebooted over and over…

I will RMA this unit ASAP.

Thanks
Simon

Hi Wayne,

Good advice. Actually I already RMA the unit yesterday and waiting for replacement unit. Once I get the replaced unit, I will test drive it to see if any of the previous symptoms still there or not.

Hi @ynjiun and @WayneWWW,

I RMA’d my unit and the replacement works great!

Thanks
Simon

great to hear that. Just curious what’s your nv_tegra_release?

cat /etc/nv_tegra_release

Hi @ynjiun,

The version is exactly the same as previously:
R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020) and the same Jetpack version 4.4.

Cheers
Simon

Hi @simon.glet and @WayneWWW,

I received my replacement unit yesterday. Since then, I already run more than 20+ hours with NoMachine connected (internet test), 4X deepstream_test_3.py with 12 videos each at GPU/CPU temperature around 52C without 600W line conditioner (just plug in to a regular UPS). No hiccup, No self reboot any more!

It is confirmed now my previous unit is faulty.

Thank you for your help.

Cheers

Hi @ynjiun @simon.glet

I am using Jetson Xavier AGX and reading 6 cameras at the same time. The phenomenal was similar with this topic. My setup environment was:

Step to reproduce:

  • Power ON Jetson Xavier, read 6 camera at the same time.
  • Make the CPU higher (all 8 cores CPU up to 100% by 8 simple bash command ‘while true; do true; done’)
  • System come to reboot. (This is also relate to another topic of @ynjiun. In bad day, it reboot after 3-30 mins, but good day it can be two or three hours).
    What I saw from these topic of ynjiun was that the issue is because of ethernet network is connected? Was it correct? What was the replacement unit did you replace @ynjiun ?

I received my replacement unit yesterday. Since then, I already run more than 20+ hours with NoMachine connected (internet test),

syslog (3.1 MB)
kern.log (3.0 MB)

Please refer my log for more detail.

Thank you very much for your help.
best regards,
Loc Hoang

Hi @v.lochd,

Even after returning the original unit, I still have the same annoying issue.

There seems to be a firmware bug that you could fix by reading this: AGX Xavier freeze in MAXN mode - #33 by WayneWWW

Just in case you successfully apply it, please let us know.

Thanks
Simon

Hi @simon.glet

Thanks for your help. I applied the patches successfully and run overnight with 100% CPU for all 8 cores, and there is no reboot.

We will try two or three more days to make sure the issue is fixed actually.

Will let you know the result.

Best regards,
Loc Hoang