Orin NX Showing Incorrect Current Consumption Values on VDD_IN

Good morning

We are running Orin NX from NVMe SSD with L4T 35.2.1 / JP 5.1

When using jtop, we see VDD_IN inst power consumption as 156W and avg as 156W.

Reading the voltage from cat /sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon4/in1_input, I see ~4800mV which is correct, but when I view cat /sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon4/curr1_input, I see a value of 32760 static and it does not change. This is 10x higher value than expected and explains why jtop was showing 156W of power consumption.

Output of sudo dmeg -wH (logs spammed with repeating values shown below):

[  +0.962186] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005560] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005430] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.006150] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005400] cpufreq: cpu4,cur:246000,set:1984000,set ndiv:155
[  +0.005282] cpufreq: cpu4,cur:257000,set:1984000,set ndiv:155
[  +0.005224] cpufreq: cpu4,cur:246000,set:1984000,set ndiv:155
[  +0.005280] cpufreq: cpu4,cur:246000,set:1984000,set ndiv:155
[  +0.962616] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005860] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005605] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005174] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005366] cpufreq: cpu4,cur:248000,set:1984000,set ndiv:155
[  +0.005226] cpufreq: cpu4,cur:245000,set:1984000,set ndiv:155
[  +0.005241] cpufreq: cpu4,cur:247000,set:1984000,set ndiv:155
[  +0.005422] cpufreq: cpu4,cur:247000,set:1984000,set ndiv:155
[  +0.962737] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005926] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.006146] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005342] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005345] cpufreq: cpu4,cur:256000,set:1984000,set ndiv:155
[  +0.005118] cpufreq: cpu4,cur:246000,set:1984000,set ndiv:155
[  +0.005279] cpufreq: cpu4,cur:246000,set:1984000,set ndiv:155
[  +0.005247] cpufreq: cpu4,cur:247000,set:1984000,set ndiv:155
[  +0.962290] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.006547] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005489] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005726] cpufreq: cpu0,cur:246000,set:1984000,set ndiv:155
[  +0.005747] cpufreq: cpu4,cur:246000,set:1984000,set ndiv:155
[  +0.005312] cpufreq: cpu4,cur:244000,set:1984000,set ndiv:155
[  +0.005230] cpufreq: cpu4,cur:245000,set:1984000,set ndiv:155
[  +0.005184] cpufreq: cpu4,cur:246000,set:1984000,set ndiv:155

All Orin NX at our production facility are flashed with the same pinmux and cloned with the same SSDs, but this is the first time we saw an instance of this issue out of hundreds of modules flashed.

What can be done to fix this?

Hi,
Please run sudo tegrastats and check the information. The information in tegrastats shall be accurate.

Hi @DaneLLL , sudo tegrastats shows the same incorrect output.

If the output from the command cat /sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon4/curr1_input is incorrect then tegrastats will definitely be incorrect as well since the value is derived from this ina3221 source.

Tegrastats output:
09-05-2024 08:38:57 RAM 8510/14486MB (lfb 7x4MB) SWAP 0/7243MB (cached 0MB) CPU [49%@247,23%@247,20%@247,23%@247,20%@246,21%@247,32%@247,22%@252] EMC_FREQ 0%@3199 GR3D_FREQ 0%@0 GR3D2_FREQ 0%@917 VIC_FREQ 115 APE 174 CV0@37.75C CPU@40.156C SOC2@37.937C SOC0@39.781C CV1@37.937C GPU@37.718C tj@40.937C SOC1@40.937C CV2@37.281C VDD_IN 156199mW/156199mW VDD_CPU_GPU_CV 1269mW/1269mW VDD_SOC 2579mW/2579mW

Hi,
Please upgrade to Jetpack 5.1.4 or 6.0GA. Some issues are discovered and fixed in later release(s).

@DaneLLL We use these products in a mass-production enviroment. Upgrading to a newer jetpack version isn’t possible. Are there any troubleshooting steps I can take to solve this?

We have just had another occurance of this same issue on our production line in another build of our product.

If I am undertsanding correctly, curr4 is the summation of curr1-curr3. This is the output of all monitored current inputs on the ina3221 from one of our devices exhibiting the issue:

$ sudo cat /sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon4/curr4_input
33688
$ sudo cat /sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon4/curr3_input
576
$ sudo cat /sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon4/curr2_input
456
$ sudo cat /sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon4/curr1_input
32760

$ sudo cat /sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon4/in4_input
163800

$ sudo cat /sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon4/shunt1_resistor 
5000

Looks like in4_input is measuring 163.8V. Not sure how this is possible. curr4 and curr1 are derived from this value along with the shunt resistor value and the voltage input on the rail. All shunt resistor values are read out to be 5000 which is correct. (See Jetson Orin NX power management parameters - #13 by KevinFFF)

The same power supply supplies the main 5v power to two Orin NX modules. I can read the value of the voltage from the other Orin NX and I can confirm it is around 5 volts as intended. This 163.8v reading from the other orin is garbage.

So far, we have two Orin NX modules across two different production builds of our product exhibiting this same issue.

Are you using the devkit or custom board for Orin NX?
Please help to clarify if you would still hit the issue with the latest JP release, or is the issue specific to your board?
(you can simply get a devkit to verify)

I just have a quick test on my Orin Nano devkit with JP6.0GA(R36.3.0) and get the expected value for curr1_input.

root@tegra-ubuntu:/sys/bus/i2c/drivers/ina3221/1-0040/hwmon/hwmon1# cat curr1_input
936

Hi @KevinFFF, we are using a custom carrier board. We have hundreds of units that we have flashed to using init-rd massflash and cloned the rootfs onto nvme SSDs. Only two modules have shown this issue.

I am trying to understand if the issue root-cause lies in the hardware of the device or the software. I suspect you will not be able to replicate this on a devkit as i haven’t been able to replicate this issue on a devkit either.

Our products are built thousands of miles away in a factory and I won’t have direct access to the hardware of the modules showing this issue for a while.

Are there any other troubleshooting steps I can take remotely to diagnose this problem?