AGX Xavier power supply: very sensitive to voltage variation

I am running AGX Xavier for DeepStream (5.0) application development in JetPack 4.4. I am using the 65W power supply come with the product and plug the power supply to a UPS/Wall power plug. I am experiencing Good day and Bad day suspecting the voltage swing of my power caussing it:
Good day: never reboot automatically and system is very stable (I notice my UPS output voltage is >= 119V)
Bad day: frequent system reboot automatically for no reason and a lot hiccup when running deepstream pipeline (I notice my UPS output voltage indicating 118V, perhaps input power from the wall is low?! for a hot day, a lot of air conditioners are on…)

Question: is there a recommendation what kind of “solution” I can purchase to regulate output voltage to steady 120V to my 65W power supply regardless my wall plug input voltage swing? or just a better power supply than the one come with the product? where can I purchase it?

Thanks a lot for your help.

Hi, do you have log info when reboot happen? What’s the “tegrastat” showing when reboot? Is it caused by high temperature or power supply ability? You can get the temperature value of parts in system to identify if it is caused by thermal. And checking the power supply voltage drop with oscilloscope will be helpful to confirm if it is caused by power supply ability.

Thank you Trumany for your information.

Today is a Good Day (so far so good no auto reset and my UPS output power indicates >= 119V) so I am unable to duplicate it today. But to answer your questions:

  1. I don’t have tegrastat log and I will set it up to log it in Bad Day.
  2. however, if my memory serves well, the system temperature rarely goes beyond 40 deg C which in Good Day, that is not a problem neither. But when in Bad day, the temperate could be in low 30 deg C and still auto reset frequently. So far, I did not see a direct correlation between thermal and auto reset yet.

Will keep you post when Bad day arrives ; (

it’s noon time, the UPS output voltage drops to 118V, and my AGX Xavier auto reset again, this time I am able to capture the tegrastats log tegrastatslog.txt (114.6 KB)

the last entry of the log doesn’t show any thing special:

RAM 6302/31919MB (lfb 5669x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,30%@2265,58%@2265,38%@2265,35%@2265,69%@2265,35%@2265,37%@2265] EMC_FREQ 0% GR3D_FREQ 2% AO@36.5C GPU@39C Tdiode@40.5C PMIC@100C AUX@36C CPU@38.5C thermal@37.95C Tboard@36C GPU 3710/6514 CPU 5100/4614 SOC 3554/4444 CV 0/0 VDDRQ 928/1418 SYS5V 2929/3074

But before it auto reset, I did notice the system doesn’t response and the application running with a lot of hiccups and slow down to few frames a sec frame rate about 1 to 2 minutes before the auto reset. I checked the log and found the highest GPU temperate is around 42.5C , CPU@41.5C and Tdiode@42.75C, Thermal@39.75C which around 1 to 2 minutes before auto reset. Does this level of temperature normal? What would be the suppose thermal issue (> ??C the system will auto reset)? Do you think my system run into a thermal issue? Thanks for your help.

It is not thermal issue per your tegrastat info. Please share full kernel log info, it looks more like a sw problem.

Also please use a oscilloscope to observe the system power supply on board to check if the supply is stable enough.

attached please find the kern.log kern.log (2.7 MB) the correspondent auto reset happened around 12:16pm to 12:17pm Aug. 27 (today). After this, it also auto reset at least one time in the afternoon.

Aug 27 12:15:30 agx kernel: [ 1126.430607] payload 00000000 execute 00100001
Aug 27 12:15:30 agx kernel: [ 1126.430610] 
Aug 27 12:16:24 agx kernel: [ 1179.925109] nvgpu: 17000000.gv11b    gk20a_fifo_handle_pbdma_intr_0:2722 [ERR]  semaphore acquire timeout!
Aug 27 12:16:24 agx kernel: [ 1179.925313] nvgpu: 17000000.gv11b   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 24 for ch 509
Aug 27 12:17:26 agx kernel: [    0.000000] Booting Linux on physical CPU 0x0
Aug 27 12:17:26 agx kernel: [    0.000000] Linux version 4.9.140-tegra (buildbrain@mobile-u64-3193) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 re
vision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Wed Apr 8 18:15:20 PDT 2020
Aug 27 12:17:26 agx kernel: [    0.000000] Boot CPU: AArch64 Processor [4e0f0040]

Hi ynjiun,

Please try to share the full log if possible. Only these lines before rebooting are not enough.

Also, we would like to know the detail of how to reproduce this issue.

Is this nvidia devkit? What DS application are you running? Is it nv sample?

Hi Wayne, the full log already uploaded in the previous post. please click the file name kern.log (it’s 2.7MB) in the previous post to download and let me know if you need more info.

To duplicate the auto reset issue:

I am running deepstream_test_3.py and feeding in 12 video files on AGX Xavier using DeepStream 5.0 with JetPack 4.4.

python3 deepstream_test_3.py file:///data/agx/al.h264 file:///data/agx/ar.h264 file:///data/agx/bl.h264 file:///data/agx/br.h264 file:///data/agx/al.h264 file:///data/agx/ar.h264 file:///data/agx/bl.h264 file:///data/agx/br.h264 file:///data/agx/al.h264 file:///data/agx/ar.h264 file:///data/agx/bl.h264 file:///data/agx/br.h264

In a Bad Day (UPS output voltage around 118V), running the above test would easily create the auto reset issue.

However, if in those days that my UPS output voltage indicate >= 119V, then rarely auto reset happens.

Hi,

It is little bit complicated.

Current case and questions are:

  1. We cannot reproduce your issue with our nvidia devkit. What is the reproduce rate of this issue? For example, how long we need to wait to see any error? Since there is good day and bad day of your issue, we need to know this first.

  2. Are you using nvidia devkit and power supply for devkit too?

  3. Have you tried your test in different environment to make sure there is no voltage swing from your plug? For example, bring the device to other office or bring it home to test it.

Thanks for following up.
To answer your questions:

  1. it’s sporadic. Sometime happens consecutively within few minutes one crash after another. Sometime might take an hour to happen. The more pipelines running the easier to duplicate the issue. At 118V or below input voltage (for 119V and above input voltage is very hard to duplicate), for example in MAXN mode fan 255, if we run 3 deepstream_test_3.py and each feeding in 12 videos (with different length of videos, not all the same videos, if all 12 videos are the same length, it might not easy to duplicate the problem). Then it might quickly reproduce (I guess when the GPU+CPU+SYSS+etc (particularly GPU) in-rush high current draw might cause a sudden reduction of the output voltage of the power supply… although the average total power consumption is within 30W. [avg 24W and peak 31W])
  2. I am using nvidia Jetson AGX Xavier devkit and use the power supply come with the devkit.
  3. just bought a 600W voltage regulator (line conditioner) and plug the power supply to the regulator. I am going to test it to see if the crash still happen or not. So far I am able to run the above stated 3Xdeepstream_test_3.pyX12video files (total 36 video files running on 3 different pipelines) and without hiccup yet…

Also want to know …

Is the gpu error always seen before the system shutting down? I notice one of your logs shows the gpu error but it happened 20 mins before the reboot. It looks not have direct connection with the reboot in this case.

you are correct. Sometime, I notice all of sudden the system grind to a halt but did not crash, eventually recover by itself. I think that is what you saw 20 mins before reboot.

In a typical auto reset situation looks like this: all of sudden the system grind to a halt, for about 1 minute later, it reboot automatically.

Is it always “cannot do kernel paging” from kernel log and stack dump shows gk20a when this error happens?

Since I cannot reproduce this issue (maybe not bad day today), maybe we need to collect the error from your side.

Let’s see if it is always same driver that causing the problem. Also, please try to post all errors from gpu if possible. So far your log is always a truncated one.

Also, are you using syslog or log from serial console? It looks like a syslog to me.

I can post more logs. Please let me know which log you need under /var/log
kern.log (922.9 KB)
kern.log.2.gz (341.0 KB)
syslog.2.gz (92.7 KB)
I encountered one auto reset this afternoon. You may find it in the attached kern.log. (funny cannot upload syslog, syslog.1, kern.log.1 because of their suffix not allowed…)

Hi,

You could check the serial console log. But remember to disable the “quiet” in extlinux.conf. Otherwise the log from kernel would be silent.

Serial console log is not under /var/log. It will even dump the log from bootloader.
https://elinux.org/Jetson/General_debug

It may be possible same result as your current logs. I just want to make sure nothing missing from syslog.

According to latest log, I saw there is gpu error in your log again, 20 sec later, with a kernel panic… But this time there is no kernel paging error. It is cpu error and is from eqos driver… (ethernet controller).

will do (but perhaps I need to dig out how or you may give me couple hints on how to record serial console log)

So far, what’s your take? is this gpu error caused by software? or voltage fluctuation?

I think that is caused by gpu driver. We will investigate this.

Hi,

Sorry that I just notice something from your log.
Are you sure your tegrastats result in #5 the correct one?

Because the gpu loading there is only 2%, it is unlike a case that would cause gpu error.

Also, your device seems unstable from the beginning. Kernel panic from eqos_napi_poll_rx seems always be there but it does not 100% cause the problem.

For example, below one has error at 8 pm but the system reboots after hours.

ug 30 20:00:51 agx kernel: [ 685.742381] [] napi_gro_receive+0x15c/0x188
Aug 30 20:00:51 agx kernel: [ 685.742397] [] eqos_napi_poll_rx+0x358/0x430
Aug 30 20:00:51 agx kernel: [ 685.742405] [] net_rx_action+0xf4/0x358
Aug 30 20:00:51 agx kernel: [ 685.742413] [] __do_softirq+0x13c/0x3b0
Aug 30 20:00:51 agx kernel: [ 685.742428] [] irq_exit+0xd0/0x118
Aug 30 20:00:51 agx kernel: [ 685.742436] [] __handle_domain_irq+0x6c/0xc0
Aug 30 20:00:51 agx kernel: [ 685.742443] [] gic_handle_irq+0x5c/0xb0
Aug 30 20:00:51 agx kernel: [ 685.742450] [] el1_irq+0xe8/0x194
Aug 30 20:00:51 agx kernel: [ 685.742465] [] smpboot_thread_fn+0xd4/0x248
Aug 30 20:00:51 agx kernel: [ 685.742473] [] kthread+0xec/0xf0
Aug 30 20:00:51 agx kernel: [ 685.742481] [] ret_from_fork+0x10/0x30
Aug 30 20:00:52 agx kernel: [ 686.469476] INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 30 20:00:52 agx kernel: [ 686.469710] 0-…: (1 GPs behind) idle=543/140000000000002/0 softirq=93458/93460 fqs=2154
Aug 30 20:00:52 agx kernel: [ 686.469863] (detected by 1, t=5252 jiffies, g=13903, c=13902, q=6)
Aug 30 20:00:52 agx kernel: [ 686.469999] Task dump for CPU 0:
Aug 30 20:00:52 agx kernel: [ 686.470009] ksoftirqd/0 S 0 3 2 0x00000002
Aug 30 20:00:52 agx kernel: [ 686.470019] Call trace:
Aug 30 20:00:52 agx kernel: [ 686.470050] [] __switch_to+0x9c/0xc0
Aug 30 20:00:52 agx kernel: [ 686.470055] [<000000000000000e>] 0xe
Aug 31 13:18:53 agx kernel: [ 0.000000] Booting Linux on physical CPU 0x0
Aug 31 13:18:53 agx kernel: [ 0.000000] Linux version 4.9.140-tegra (buildbrain@mobile-u64-3193) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Wed Apr 8 18:15:20 PDT 2020

Is this reboot triggered by you manually or the system?
Could you run sudo tegrastats again with DS application and wait for the error coming again? I need to know the tegarsstats result right before the system reboot.

Please do confirm the tegrastats result.
We ran another 3 hours with ds sample but the tegrastats result is totally different from your case.