MAXn power mode in production?

Hi there,
Our Jetsons Xavier AGX’s have been shutting themselves completely down when running deep learning models in the MAXn power mode. I’m assuming it’s a power issue because there are no kernel messages and the highest temperature is ~70c so I want to confirm if this mode is actually meant to be used in production? Where can I find that information?
Thanks for the help with this!!

Hi @eden3 ,
I have used the AGX Xavier for DL workloads using the MAXn profile and the jetson-clocks script and left it overnight with no issues. I have a couple of questions:

  • What power brick are you using?

  • Have you tried to force the fan to run? I have seen that the default fan curve is focused on low noise, so you can fix the speed to 40-30% using something like jtop even when using jetson-clocks

Regards,
Andres
Embedded SW Engineer at RidgeRun
Contact us: support@ridgerun.com
Developers wiki: https://developer.ridgerun.c om/
Website: www.ridgerun.com

Are you running a serial console log of dmesg --follow at the moment of shutdown? If you are, then this might say something regular logs don’t have time to save.

@ andres.artavia, thanks for the response. We have also had success leaving our system overnight with certain configurations but found one where we routinely see these full power shutdowns and the only difference is load on the system.

This is why I’m wondering if MAXn is intended for production.

@linuxdev, thanks for the response! Yes, we have done that and there were no messages leading up to the crash in dmesg.

Right after reboot, check this out:

sudo -s
cd /sys
find . -type f -iname '*reason*' | xargs egrep -a -i '*'

This should name some files, and then the content. Do any of the files show a reason for reset?

It actually happened recently and I haven’t reset it since. The jetson felt pretty warm afterwards but nothing indicates that it got too hot.

./kernel/pmc/tegra_reset_reason:TEGRA_POWER_ON_RESET
./firmware/devicetree/base/chosen/reset/pmic-reset-reason:
./firmware/devicetree/base/chosen/reset/pmc-reset-reason:

The “reason” will pertain to the last boot. Since that “reason” was not right after a glitched shutdown it won’t say anything useful. Whenever this happens next, make that query the first thing after you boot it.

Yes, this is the result right after the failure. It fully powers itself off and I need to press the power button to get it back on. It has a jumper cable that usually auto-powers it when it’s plugged in, so needing to press the power button is unusual.

So apparently the system believes the cause:
tegra_reset_reason:TEGRA_POWER_ON_RESET

Someone from NVIDIA could probably give you more details on what that actually means. For example, perhaps the power supply itself shutting down and restarting could cause that, or perhaps it has to be a particular even and the power supply cannot do that. That’s a starting point.

Thank you! Yeah, I’m hoping someone from NVIDIA can at least tell me if MAXn power mode is meant to be used in production. It would be a shame if we need to use MODE_30W_ALL, but we need a stable product.

I don’t know if TEGRA_POWER_ON_RESET would trigger in the case of a power supply with insufficient power regulation, but keep in mind that Jetsons are quite sensitive to power quality/regulation. Hopefully someone from NVIDIA can comment on whether power supply regulation might result in TEGRA_POWER_ON_RESET.

I am using the power brick that came with the Jetson. 19V, 3.4A, 65W

I have set the fan mode to cool using: nvpmodel -d cool and still saw the issue. I’ll have to look into jtop. Have you seen similar issues due to the temperature?

I have seen various nvpmodels throttling back to prevent consuming more power than the model allows (and this includes MAXN; this has a higher current allowed, but still has limits). I’ve seen throttling back due to temperature. I have not seen shutdown due to temperature except for cases such as not having a heat sink. If you think it is a temperature issue, then one thing you might do is have an ordinary desktop fan cooling the heat sink in addition to the attached fan and see if the problem decreases or is eliminated. Someone from NVIDIA would need to comment on what “unusual” conditions might result in TEGRA_POWER_ON_RESET.

Thank you. I don’t believe it’s a temperature issue, but it was worth ruling out the fan mode regardless. The highest temperature reported is on the GPU, which seems to sit consistently around 70C.

After more testing I expect that the power supplies that come with the Jetson are not sufficient to operate at the load we require. We’ve reproduced the issue in another Jetson and saw the failures increase with a power supply with a longer harness (and same gauge wire), but resolve with a shorter harness. Because of this we suspect a voltage drop over the longer harness causing issues.

We also recreated the failure (total shutdown) in the MODE_30W_ALL power mode.

I would still like NVIDIA’s feedback on the MAXn power mode being stable enough to be used in the field.

Incidentally, is the power supply you refer to USB-C, or is it a barrel jack connector? The former (USB-C) is itself limited by the USB specification; the latter (barrel jack) is capable of providing more power. If you are already using the barrel jack, then I wouldn’t expect one of the NVIDIA-supplied power units to be a problem (although perhaps if there is a PCI device and enough other devices? not sure).

Surprisingly, we are using the barrel jack and have reproduced this with multiple bricks/Jetsons.

This sounds like an interesting issue. Are you certain the temperature was ok? 70 C is easily within specs and should not cause a problem. If you happen to have a thermal camera which is calibrated so you can look for something over 95 C, then that would be very interesting if some tiny place had that high of a temperature. If not, then I’d say the ability to reproduce this via barrel jack and supplied PSU means the software responsible for shutdown and throttling at MAXn has problems. I’d think for such a case that this would be something NVIDIA could look into since it can be reproduced and is not from heat.

Also, is the carrier board a dev kit or something else? This has a huge impact.

Yes, a dev kit.
I’ll have to check around for a thermal camera to rule out a hotspot that doesn’t have a temp sensor nearby.
I’m 95% sure it’s not temp related because I ran another test with a variable power supply set to 19V/10A and noticed that changing the length of the power harness seemed directly related to the failure.

power harness length behavior
~3in never failed
~1.5ft brownouts seemingly identical to nvidia 19V/65W power brick
~6ft brownouts every time starting ML process

In addition, I recreated the brown out in the MODE_30W_ALL power mode, and with fan mode set to cool.

Jetsons are extraordinarily sensitive to power regulation quality. A longer cable just says it is a power regulation issue if that increases the problem. Do you have a longer cable which has a significantly thicker wire gauge? Or, could you put something like a 2000 uF capacitor and a 100 uF capacitor right where the barrel jack enters? Capacitance right next to the barrel jack can help a lot if that is the issue. Note that higher capacitance is needed, but higher capacitance usually has more inductance, and thus the second 100 uF capacitor.