AGX Xavier power supply: very sensitive to voltage variation

Hi,
Please refer to this post and the attached kern.log in the post. The system automatically reboot right after the nvgpu_set_error occurred at Aug 27 12:16:24. The associated tegrastats log is in here These two logs are confirmed case. Please be aware “the 12 video files have to be different length” to easily duplicate the problem. For example, four videos with different length feed into python3 deepstream_test_3.py, the problem most likely to happen is when the short video run out and the long video still running in the same test…

Hi Wayne, on the way of setting up serial console, I capture something very interesting which abnormal: I set up the serial console first time and I manually go to AGX Xavier to click power off button. Instead of power off, the system crashed and reboot by itself. Lo and behold, I captured the serial console log as attached console.log (29.8 KB). As you can see that system went into “Kernel panic” after I click on “power off”!? and the system did not power off, instead it reboot by itself! Could you tell what’s wrong with my system? Thanks a lot for your help.

P.S. then after I shutdown the system. I restart it again, and cannot restart anymore, it hang during the power up, see this console log abnormal_start_crash.log (27.8 KB)

Hi ynjiun,

  1. Do you mean you can run into the kernel panic without running any DS application but just click power button?

  2. Just need to let you again. Current serial console is not completed because the quiet keyword in /boot/extlinux/extlinux.conf, please remove it and reboot.

I think this is device specific or maybe voltage variation… do you have other xavier devkit to verify? We put our device with same sample and running overnight, still cannot see any error.

Hi Wayne,

Yes, just click the power off and create kernel panic without running any DS application.

After this incident, I try to set extlinux.conf quiet and try to reboot again, and never successfully reboot anymore. Then I reflash the device and have a brand new start, just now, I was able to duplicate the crash with serial console log capture (quiet removed) and tegrastats capture by running the same 4 pipelines of deepstream_test_3.py feeding with 12 videos each (total 48 videos feeding to 4 copies of deepstream_test_3.py running at the same time).

The logs are attached: console.log (195.6 KB)tegrastats.log (168 KB)

Ok. Our test only run x1 process with 12 videos so the same as your case.

Let us run again today.

Hi,

Could you reproduce issue with only 30 videos? It is the maximum number we could support on xavier.
I am also running 30 videos on my side now.

I think I know how to duplicate it more effectly now: please follow the steps below:

  1. powerup with MAXN mode at fan speed 0
  2. start to run x4 pipelines each with 12 videos
  3. after all 4 pipelines are running, try to change fan speed to any number (175 or 255, etc] if you are lucky, the first time you might hit gpu error, if not, try to change fan speed few times

Once it hit gpu error (typically error 137 locked error), then the system grind to a halt, and few minutes later (sometime as short as 20 secs), it auto reboot…

[  307.750522] nvgpu: 17000000.gv11b    gk20a_fifo_handle_pbdma_intr_0:2722 [ERR]  semaphore acquire timeout!
[  307.750753] nvgpu: 17000000.gv11b   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 24 for ch 509

After thought EDIT: in retrospect, the step 1 with fan speed 0 might be the problem => by the time 4 pipelines with 12 videos each running, it might take more than 30 sec without fan running, and GPU temperature might be easily going up > 43C or even higher, by the time when the fan is turning on at step 3, the GPU might be already overheated… well just a guess, wouldn’t want to duplicate this overheated issue ; ))

Hi,

Please use <30 videos afterwards. We don’t support >30 videos.

Also, I think this method is not good. You should not put the fan to stop when running such heavy loading usecase…

Did you observe tegrastats when you use this quick method to reproduce issue?

Hi,

I tried to change the fan speed but it does not reproduce the issue on our side.

Wayne, thank you for the test.

Since your suggestion of turning on the fan and keep the videos < 30 and with my 600W line conditioner, so far I am not able to duplicate it anymore (even I change the fan speed every sec just to test it)

In summary, this is what I do to clean up my crash (auto reset) issue:

  1. reflash my device
  2. plug in the 65W power supply to a 600W line conditioner (my outlet voltage swing might be too big, when < 118V, I see my system crash more frequently)
  3. keep the number of videos in my apps pipeline < 30
  4. make sure the fan is running at 255 when thermal temperature > 35C (I checked the tegrastats log when GPU < 43C, it seems the system is much more stable…)

Well, I will keep you post if anything changes (or new discovery). Thanks a lot for your help along the way ; ))

2 Likes

Hi Wayne, just for your information. That my AGX Xavier still kept rebooting by itself. See my latest post here