Long term impact of Instantaneous Power/OC fault trigger on an Jetson Orin NX 16GB

I have a MIC-711-OX4A1 (Orin NX 16GB, JetPack 5.1.2) that has been showing Overcurrent Protection in the U20.04 GUI. This while running a YOLO model fairly consistently at ~95% GPU load. The system is in the MAXN mode, temperatures are stable and for now I’m leaving gdm3 disabled/no GUI. Most CPU cores are very lightly loaded (one around 70% the rest around 20%), so this issue is GPU driven. Power is fed using a short (~10in.) run of 18AWG cabling from a good PSU (exclusively on its 24V/max 3.7A rail). In the Nvidia Power GUI, plotting power, voltage, current over time show no anomalies even over hourly runs. Voltage seems to only drop by a negligible ~0.06V or so (I know this is at a very low sample rate though).

From looking at event counters, /sys/devices/platform/soctherm-oc-event/hwmon/hwmon1/oc3_event_cnt is the culprit, which according to the dev guide suggests the issue is “VDD_IN Instantaneous Power: 30W”. If I adjust the max GPU frequency down from the highest possible 918MHz to 816MHz it still triggers (say one or two a second) and at 714MHz is almost entirely non-triggering. At 918MHz the trigger rate seems to be tens per second. My frame rate drops noticeably when stepping down the frequency (essentially by half)!

for h in /sys/devices/platform/soctherm-oc-event/hwmon/hwmon1/oc*; do printf "$h .. "; cat “$h”; done
/sys/devices/platform/soctherm-oc-event/hwmon/hwmon1/oc1_event_cnt .. 0
/sys/devices/platform/soctherm-oc-event/hwmon/hwmon1/oc1_throt_en .. 1
/sys/devices/platform/soctherm-oc-event/hwmon/hwmon1/oc2_event_cnt .. 0
/sys/devices/platform/soctherm-oc-event/hwmon/hwmon1/oc2_throt_en .. 1
/sys/devices/platform/soctherm-oc-event/hwmon/hwmon1/oc3_event_cnt .. 84762
/sys/devices/platform/soctherm-oc-event/hwmon/hwmon1/oc3_throt_en .. 1

My questions are:

  1. Is there potentially something wrong with this board/device? Is there a way to diagnose its health?
  2. Most importantly, if I just leave the 918MHz max GPU frequency and it keeps triggering the oc3 event (at the rate of seemingly tens of triggers every second), does this have any risk to degrading the board or causing a failure somewhere?
  3. Since my CPU is so lightly loaded, why is the 30W spike being triggered? Shouldn’t the lower power demand from the CPUs leave more headroom for the GPU spikes?
  4. Is there a better way for me to throttle back (maintaining more performance while reducing the event triggers)?

Thanks

Sorry for the late response.
Is this still an issue to support? Any result can be shared?

Please do not run test in MAXN mode. That is not something you should run test for.

Running in MAXN will trigger over current situation because that one will exceed the power budget of the Orin device itself.

Thanks for your response. I am a bit confused though and hopefully you can clarify. This device is not a devkit, it’s full fledged (expensive) board from Advantech. Isn’t the Jetson hardware supposed to be able to support its own software modes/settings without potential long term board issues/damage? Are long term damage/degradation/issues a possibility if I keep the max GPU frequency at 918MHz and MAXN with this counter triggering? Do the chances of this go down if I decrease the frequency one step with less event triggering?

Sorry if my questions are obtuse, I’m just really trying to understand this a bit better as this topic is not my field.

Hi,

Just to clarify

  1. Whether you are using a board from Advantech here does not matter. If this is devkit or any other custom board from other vendors, this rule is still same.

  2. We have a document here to indicate why MAXN should not be in use.
    Jetson Orin Nano Series, Jetson Orin NX Series and Jetson AGX Orin Series — NVIDIA Jetson Linux Developer Guide

If you want a simple explanation for this issue, it is just Orin NX itself has a power budget limit.
The overall power of a Jetson is formed by lots of variables. For example, CPU freq/GPU freq/EMC freq…etc. When all these things go to max on Orin, it might over the power limit.
When you run MAXN, it is just like “don’t care about whether it will exceed power limit, just go all out”. But as we have protection mechanism there to protect your board from hardware problem, this attempt will definitely trigger the protection and that is the OC throttling. In this situation, your performance would be dropped due to the throttling.

To prevent the throttling, you can only choose some other power modes which would be under the power limit. Of course your performance may not be as well as that “All go out” mode. But this would prevent OC throttling happened.

If you want to have a higher power budge device, then Orin NX and Orin AGX might be the options.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.