System throttling due to over-current on JP4.6.2

We are using Jetson Xavier NX with JP 4.6.2 with Photon Carrier from ConnectTech for analysing traffic data from camera (e-CAM21_CUNX - Sony STARVIS™ IMX327 from Econ Systems) connected to the board. Our current detector is Yolo V4 (int8 optimized engine), used via nvinfer plugin as part of gst-pipeline. During our latest recordings with current setup we found many error messages in kern.log:

Jan 31 10:22:21 nvidia-desktop kernel: [ 1610.122320] soctherm: OC ALARM 0x00000001
Jan 31 10:22:55 nvidia-desktop kernel: [ 1644.545385] soctherm: OC ALARM 0x00000001

We have already read in the forum that it is an Over Current Problem probably caused by YOLOv4, because it has a heavy load over GPU. The strange thing is that sometimes this leads to the nvargus-daemon to timeout (but not always).

[ERROR] [31.01.2023 10:23:54] NvArgusCameraSrc: TIMEOUT (6): Argus Error Status

Sometimes there are too many error messages like these (more than a thousand of errors for recording which is running for an hour ) but nvargus-daemon doesn’t crash with timeout error:

[ 2776.298962] soctherm: OC ALARM 0x00000001
[ 2887.703029] soctherm: OC ALARM 0x00000001
[ 2963.969347] soctherm: OC ALARM 0x00000001
[ 3010.580975] soctherm: OC ALARM 0x00000001
[ 3480.486639] soctherm: OC ALARM 0x00000005
[ 3582.778696] soctherm: OC ALARM 0x00000001
[ 3719.112141] soctherm: OC ALARM 0x00000001
[ 3762.852756] soctherm: OC ALARM 0x00000001
[ 3787.786256] soctherm: OC ALARM 0x00000001
[ 3893.516713] soctherm: OC ALARM 0x00000001
[ 3951.728683] soctherm: OC ALARM 0x00000001
[ 4007.985995] soctherm: OC ALARM 0x00000001
[ 4103.323676] soctherm: OC ALARM 0x00000001

It is possible that nvargus-daemon timeout is not caused by OC alarm error.
Econ Systems suggested executing this scipt max-isp-vi-clks.sh before recording and it will set :
Maximum ISP clock set : 576000000
Maximum VIC clock set : 601600000
Maximum EMC clock set : 1866000000
Maximum VI clock set : 460800000
This will prevent nvargus-timeout error.
Our question is do we need to concern about these errors OC ALARM or are they just an alert that we are closer to the limits of the system ?
Probably these errors are different levels of HARDWARE throttling, is there any explanations of different OC ALARM errors?

Hi,

Here is the explanation.

Hi @WayneWWW,
we have played a little bit with power estimator tool and also with
/sys/devices/c250000.i2c/i2c-7/7-0040/iio:device0/crit_current_limit_0
/sys/devices/c250000.i2c/i2c-7/7-0040/iio:device0/warn_current_limit_0

crit_current_limit_0 was never increased more than 5000. In our tests we always decreased this value to see OC ALARM messages even with less workload.

We realized that if the value in warn_current_limit_0 is reached then following message appears
Feb 8 07:08:16 nvidia-desktop kernel: [45863.993379] soctherm: OC ALARM 0x00000004

If crit_current_limit_0 has also value 3000ma then OC ALARM has different code
soctherm: OC ALARM 0x00000001
soctherm: OC ALARM 0x00000005
soctherm: OC ALARM 0x00000000

Can we find somewhere what is the difference between different oc alarms ?
Does the system always go in throttling mode when reach the values from warn_current_limit_0 & crit_current_limit_0 ? Or first level is just warning when the value from warn_current_limit_0 is reached and then if the devices is heavily used and reaches crit_current_limit_0 then goes in throttling mode ?

We’re checking VDD_IN value in tegrastats logs and it seems to be in range 8000-10000
but OC ALARM message appears If crit_current_limit_0 is set to 3000ma. Did we calculate something wrong? I think it should be in throttling mode if the board reaches 15W.
File /sys/devices/c250000.i2c/i2c-7/7-0040/iio:device0/in_voltage0_input contains value 4984.

RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [20%@1190,23%@1190,38%@1190,24%@1190,18%@1190,22%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 98%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.75C VDD_IN 8532/8532 VDD_CPU_GPU_CV 3668/3668 VDD_SOC 2428/2428
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [24%@1190,20%@1190,33%@1190,25%@1190,20%@1190,22%@1190] EMC_FREQ 29%@1866 GR3D_FREQ 98%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@28.5C thermal@28.75C VDD_IN 8452/8492 VDD_CPU_GPU_CV 3588/3628 VDD_SOC 2468/2448
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [30%@1190,19%@1190,45%@1190,27%@1190,21%@1190,22%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 4%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.75C VDD_IN 8692/8558 VDD_CPU_GPU_CV 3741/3665 VDD_SOC 2468/2454
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [22%@1190,20%@1190,40%@1190,25%@1190,24%@1190,16%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 1%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.9C VDD_IN 8771/8611 VDD_CPU_GPU_CV 3861/3714 VDD_SOC 2468/2458
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [27%@1190,21%@1190,34%@1190,31%@1190,18%@1190,20%@1190] EMC_FREQ 29%@1866 GR3D_FREQ 2%@752 NVENC 499 NVENC1 499 VIC_FREQ 4%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@28.5C thermal@28.9C VDD_IN 8692/8627 VDD_CPU_GPU_CV 3781/3727 VDD_SOC 2468/2460
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [24%@1190,20%@1190,38%@1190,29%@1190,17%@1190,20%@1190] EMC_FREQ 29%@1866 GR3D_FREQ 25%@752 NVENC 499 NVENC1 499 VIC_FREQ 25%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.9C VDD_IN 8572/8618 VDD_CPU_GPU_CV 3668/3717 VDD_SOC 2468/2461
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [30%@1190,23%@1190,36%@1190,25%@1190,19%@1190,22%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 39%@752 NVENC 499 NVENC1 499 VIC_FREQ 21%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.9C VDD_IN 8532/8606 VDD_CPU_GPU_CV 3662/3709 VDD_SOC 2468/2462
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [30%@1190,25%@1190,36%@1190,26%@1190,19%@1190,19%@1190] EMC_FREQ 29%@1866 GR3D_FREQ 58%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30.5C iwlwifi@32C PMIC@50C AUX@28C CPU@28.5C thermal@28.9C VDD_IN 8412/8581 VDD_CPU_GPU_CV 3588/3694 VDD_SOC 2428/2458
RAM 4350/7773MB (lfb 259x4MB) SWAP 0/3886MB (cached 0MB) CPU [35%@1190,26%@1190,39%@1190,34%@1190,67%@1190,26%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 98%@752 NVENC 499 NVENC1 499 VIC_FREQ 6%@601 APE 150 MTS fg 0% bg 6% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@29.05C VDD_IN 8891/8616 VDD_CPU_GPU_CV 4020/3730 VDD_SOC 2468/2459

So we defined our custom power mode with reduced CPU & GPU frequencies, but would like to know more about VDD_IN value and how prevent OC ALARMS.
With predefined mode 15W 6 cores the message should be visible if the device uses more than 15W and critical when uses 25W, correct?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.