OC ALARM while doing inference

Hello,

We are currently working on a product based on the Jetson TX2 NX platform.
I have recently seen the following error message in the dmesg:

[ 4333.277276] soctherm: OC ALARM 0x00000011
[ 4334.421116] soctherm: OC ALARM 0x00000011
[ 4335.993122] soctherm: OC ALARM 0x00000001
[ 4337.439201] soctherm: OC ALARM 0x00000011
[ 4338.580276] soctherm: OC ALARM 0x00000001
[ 4339.707316] soctherm: OC ALARM 0x00000001
[ 4340.713568] soctherm: OC ALARM 0x00000010

This happens on all our TX2 NX, only when doing inference. Our Jetson are in MAXN (0) mode.

After some search on the kernel source code and on this forum, this seem’s to be related to some thermal or electrical issue.
Our engineers have checked, but they haven’t seen anything exceeding the Jetson specifications about power supply or thermal configuration.

We are looking to understand these error codes and also to be sure that theses alarm won’t affect performances of our algorithm by throttling CPU or GPU speed.

Thanks for your help.

hello Pinout21,

please see-also Topic 188504.
for test purpose, you may revise current-critical-limit-ma to avoid such warning alarms.

Hello JerryChang,

We have already tryed that in our build but the warning is still here.

Of course we have tried with the official carrier board with different power supply with higher spec’s that recommanded (lab power supply 5V/10A for example) but this message still appears.
For information we have checked our 5V with oscilloscope and even tryed higher voltage 5,2V for example.

The main concern is, does that affect performance ?

this is due to it’s reaching the hardware spec, and it’s trying to protect hardware.
you should also check whether cpu/gpu freq drops after OC event happens.
anyways, the actual solution is using the powerestimator to create custom power mode.

How can we check the CPU/GPU freq without using tegrastat ?
Unfortunatly, the powerestimator doesn’t work with Jetson TX2.

you may follow below to monitor CPU/CPU frequency.
CPU freq: $ watch -n 0.1 cat /sys/devices/system/cpu/cpufreq/policy0/cpuinfo_cur_freq
GPU freq: $ watch -n 0.1 cat /sys/kernel/debug/bpmp/debug/clk/nafll_gpu/pto_counter

Thanks.
Here is what i understand.
The MAXN mode is some kind of overclocking and should not be used in production. The caveat is that others mode loose performances.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.