AGX Xavier power supply: very sensitive to voltage variation

Also want to know …

Is the gpu error always seen before the system shutting down? I notice one of your logs shows the gpu error but it happened 20 mins before the reboot. It looks not have direct connection with the reboot in this case.

you are correct. Sometime, I notice all of sudden the system grind to a halt but did not crash, eventually recover by itself. I think that is what you saw 20 mins before reboot.

In a typical auto reset situation looks like this: all of sudden the system grind to a halt, for about 1 minute later, it reboot automatically.

1 Like

Is it always “cannot do kernel paging” from kernel log and stack dump shows gk20a when this error happens?

Since I cannot reproduce this issue (maybe not bad day today), maybe we need to collect the error from your side.

Let’s see if it is always same driver that causing the problem. Also, please try to post all errors from gpu if possible. So far your log is always a truncated one.

Also, are you using syslog or log from serial console? It looks like a syslog to me.

I can post more logs. Please let me know which log you need under /var/log
kern.log (922.9 KB)
kern.log.2.gz (341.0 KB)
syslog.2.gz (92.7 KB)
I encountered one auto reset this afternoon. You may find it in the attached kern.log. (funny cannot upload syslog, syslog.1, kern.log.1 because of their suffix not allowed…)

Hi,

You could check the serial console log. But remember to disable the “quiet” in extlinux.conf. Otherwise the log from kernel would be silent.

Serial console log is not under /var/log. It will even dump the log from bootloader.
https://elinux.org/Jetson/General_debug

It may be possible same result as your current logs. I just want to make sure nothing missing from syslog.

According to latest log, I saw there is gpu error in your log again, 20 sec later, with a kernel panic… But this time there is no kernel paging error. It is cpu error and is from eqos driver… (ethernet controller).

will do (but perhaps I need to dig out how or you may give me couple hints on how to record serial console log)

So far, what’s your take? is this gpu error caused by software? or voltage fluctuation?

I think that is caused by gpu driver. We will investigate this.

Hi,

Sorry that I just notice something from your log.
Are you sure your tegrastats result in #5 the correct one?

Because the gpu loading there is only 2%, it is unlike a case that would cause gpu error.

Also, your device seems unstable from the beginning. Kernel panic from eqos_napi_poll_rx seems always be there but it does not 100% cause the problem.

For example, below one has error at 8 pm but the system reboots after hours.

ug 30 20:00:51 agx kernel: [ 685.742381] [] napi_gro_receive+0x15c/0x188
Aug 30 20:00:51 agx kernel: [ 685.742397] [] eqos_napi_poll_rx+0x358/0x430
Aug 30 20:00:51 agx kernel: [ 685.742405] [] net_rx_action+0xf4/0x358
Aug 30 20:00:51 agx kernel: [ 685.742413] [] __do_softirq+0x13c/0x3b0
Aug 30 20:00:51 agx kernel: [ 685.742428] [] irq_exit+0xd0/0x118
Aug 30 20:00:51 agx kernel: [ 685.742436] [] __handle_domain_irq+0x6c/0xc0
Aug 30 20:00:51 agx kernel: [ 685.742443] [] gic_handle_irq+0x5c/0xb0
Aug 30 20:00:51 agx kernel: [ 685.742450] [] el1_irq+0xe8/0x194
Aug 30 20:00:51 agx kernel: [ 685.742465] [] smpboot_thread_fn+0xd4/0x248
Aug 30 20:00:51 agx kernel: [ 685.742473] [] kthread+0xec/0xf0
Aug 30 20:00:51 agx kernel: [ 685.742481] [] ret_from_fork+0x10/0x30
Aug 30 20:00:52 agx kernel: [ 686.469476] INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 30 20:00:52 agx kernel: [ 686.469710] 0-…: (1 GPs behind) idle=543/140000000000002/0 softirq=93458/93460 fqs=2154
Aug 30 20:00:52 agx kernel: [ 686.469863] (detected by 1, t=5252 jiffies, g=13903, c=13902, q=6)
Aug 30 20:00:52 agx kernel: [ 686.469999] Task dump for CPU 0:
Aug 30 20:00:52 agx kernel: [ 686.470009] ksoftirqd/0 S 0 3 2 0x00000002
Aug 30 20:00:52 agx kernel: [ 686.470019] Call trace:
Aug 30 20:00:52 agx kernel: [ 686.470050] [] __switch_to+0x9c/0xc0
Aug 30 20:00:52 agx kernel: [ 686.470055] [<000000000000000e>] 0xe
Aug 31 13:18:53 agx kernel: [ 0.000000] Booting Linux on physical CPU 0x0
Aug 31 13:18:53 agx kernel: [ 0.000000] Linux version 4.9.140-tegra (buildbrain@mobile-u64-3193) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Wed Apr 8 18:15:20 PDT 2020

Is this reboot triggered by you manually or the system?
Could you run sudo tegrastats again with DS application and wait for the error coming again? I need to know the tegarsstats result right before the system reboot.

Please do confirm the tegrastats result.
We ran another 3 hours with ds sample but the tegrastats result is totally different from your case.

Hi,
Please refer to this post and the attached kern.log in the post. The system automatically reboot right after the nvgpu_set_error occurred at Aug 27 12:16:24. The associated tegrastats log is in here These two logs are confirmed case. Please be aware “the 12 video files have to be different length” to easily duplicate the problem. For example, four videos with different length feed into python3 deepstream_test_3.py, the problem most likely to happen is when the short video run out and the long video still running in the same test…

Hi Wayne, on the way of setting up serial console, I capture something very interesting which abnormal: I set up the serial console first time and I manually go to AGX Xavier to click power off button. Instead of power off, the system crashed and reboot by itself. Lo and behold, I captured the serial console log as attached console.log (29.8 KB). As you can see that system went into “Kernel panic” after I click on “power off”!? and the system did not power off, instead it reboot by itself! Could you tell what’s wrong with my system? Thanks a lot for your help.

P.S. then after I shutdown the system. I restart it again, and cannot restart anymore, it hang during the power up, see this console log abnormal_start_crash.log (27.8 KB)

Hi ynjiun,

  1. Do you mean you can run into the kernel panic without running any DS application but just click power button?

  2. Just need to let you again. Current serial console is not completed because the quiet keyword in /boot/extlinux/extlinux.conf, please remove it and reboot.

I think this is device specific or maybe voltage variation… do you have other xavier devkit to verify? We put our device with same sample and running overnight, still cannot see any error.

Hi Wayne,

Yes, just click the power off and create kernel panic without running any DS application.

After this incident, I try to set extlinux.conf quiet and try to reboot again, and never successfully reboot anymore. Then I reflash the device and have a brand new start, just now, I was able to duplicate the crash with serial console log capture (quiet removed) and tegrastats capture by running the same 4 pipelines of deepstream_test_3.py feeding with 12 videos each (total 48 videos feeding to 4 copies of deepstream_test_3.py running at the same time).

The logs are attached: console.log (195.6 KB)tegrastats.log (168 KB)

Ok. Our test only run x1 process with 12 videos so the same as your case.

Let us run again today.

Hi,

Could you reproduce issue with only 30 videos? It is the maximum number we could support on xavier.
I am also running 30 videos on my side now.

I think I know how to duplicate it more effectly now: please follow the steps below:

  1. powerup with MAXN mode at fan speed 0
  2. start to run x4 pipelines each with 12 videos
  3. after all 4 pipelines are running, try to change fan speed to any number (175 or 255, etc] if you are lucky, the first time you might hit gpu error, if not, try to change fan speed few times

Once it hit gpu error (typically error 137 locked error), then the system grind to a halt, and few minutes later (sometime as short as 20 secs), it auto reboot…

[  307.750522] nvgpu: 17000000.gv11b    gk20a_fifo_handle_pbdma_intr_0:2722 [ERR]  semaphore acquire timeout!
[  307.750753] nvgpu: 17000000.gv11b   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 24 for ch 509

After thought EDIT: in retrospect, the step 1 with fan speed 0 might be the problem => by the time 4 pipelines with 12 videos each running, it might take more than 30 sec without fan running, and GPU temperature might be easily going up > 43C or even higher, by the time when the fan is turning on at step 3, the GPU might be already overheated… well just a guess, wouldn’t want to duplicate this overheated issue ; ))

Hi,

Please use <30 videos afterwards. We don’t support >30 videos.

Also, I think this method is not good. You should not put the fan to stop when running such heavy loading usecase…

Did you observe tegrastats when you use this quick method to reproduce issue?

Hi,

I tried to change the fan speed but it does not reproduce the issue on our side.

Wayne, thank you for the test.

Since your suggestion of turning on the fan and keep the videos < 30 and with my 600W line conditioner, so far I am not able to duplicate it anymore (even I change the fan speed every sec just to test it)

In summary, this is what I do to clean up my crash (auto reset) issue:

  1. reflash my device
  2. plug in the 65W power supply to a 600W line conditioner (my outlet voltage swing might be too big, when < 118V, I see my system crash more frequently)
  3. keep the number of videos in my apps pipeline < 30
  4. make sure the fan is running at 255 when thermal temperature > 35C (I checked the tegrastats log when GPU < 43C, it seems the system is much more stable…)

Well, I will keep you post if anything changes (or new discovery). Thanks a lot for your help along the way ; ))

2 Likes

Hi Wayne, just for your information. That my AGX Xavier still kept rebooting by itself. See my latest post here