System throttling due to over-current on JP4.6.2

yankov · February 6, 2023, 12:46pm

We are using Jetson Xavier NX with JP 4.6.2 with Photon Carrier from ConnectTech for analysing traffic data from camera (e-CAM21_CUNX - Sony STARVIS™ IMX327 from Econ Systems) connected to the board. Our current detector is Yolo V4 (int8 optimized engine), used via nvinfer plugin as part of gst-pipeline. During our latest recordings with current setup we found many error messages in kern.log:

Jan 31 10:22:21 nvidia-desktop kernel: [ 1610.122320] soctherm: OC ALARM 0x00000001
Jan 31 10:22:55 nvidia-desktop kernel: [ 1644.545385] soctherm: OC ALARM 0x00000001

We have already read in the forum that it is an Over Current Problem probably caused by YOLOv4, because it has a heavy load over GPU. The strange thing is that sometimes this leads to the nvargus-daemon to timeout (but not always).

[ERROR] [31.01.2023 10:23:54] NvArgusCameraSrc: TIMEOUT (6): Argus Error Status

Sometimes there are too many error messages like these (more than a thousand of errors for recording which is running for an hour ) but nvargus-daemon doesn’t crash with timeout error:

[ 2776.298962] soctherm: OC ALARM 0x00000001
[ 2887.703029] soctherm: OC ALARM 0x00000001
[ 2963.969347] soctherm: OC ALARM 0x00000001
[ 3010.580975] soctherm: OC ALARM 0x00000001
[ 3480.486639] soctherm: OC ALARM 0x00000005
[ 3582.778696] soctherm: OC ALARM 0x00000001
[ 3719.112141] soctherm: OC ALARM 0x00000001
[ 3762.852756] soctherm: OC ALARM 0x00000001
[ 3787.786256] soctherm: OC ALARM 0x00000001
[ 3893.516713] soctherm: OC ALARM 0x00000001
[ 3951.728683] soctherm: OC ALARM 0x00000001
[ 4007.985995] soctherm: OC ALARM 0x00000001
[ 4103.323676] soctherm: OC ALARM 0x00000001

It is possible that nvargus-daemon timeout is not caused by OC alarm error.
Econ Systems suggested executing this scipt max-isp-vi-clks.sh before recording and it will set :
Maximum ISP clock set : 576000000
Maximum VIC clock set : 601600000
Maximum EMC clock set : 1866000000
Maximum VI clock set : 460800000
This will prevent nvargus-timeout error.
Our question is do we need to concern about these errors OC ALARM or are they just an alert that we are closer to the limits of the system ?
Probably these errors are different levels of HARDWARE throttling, is there any explanations of different OC ALARM errors?

WayneWWW · February 7, 2023, 2:46am

Hi,

Here is the explanation.

bofirov · February 8, 2023, 3:29pm

Hi @WayneWWW,
we have played a little bit with power estimator tool and also with
/sys/devices/c250000.i2c/i2c-7/7-0040/iio:device0/crit_current_limit_0
/sys/devices/c250000.i2c/i2c-7/7-0040/iio:device0/warn_current_limit_0

crit_current_limit_0 was never increased more than 5000. In our tests we always decreased this value to see OC ALARM messages even with less workload.

We realized that if the value in warn_current_limit_0 is reached then following message appears
Feb 8 07:08:16 nvidia-desktop kernel: [45863.993379] soctherm: OC ALARM 0x00000004

If crit_current_limit_0 has also value 3000ma then OC ALARM has different code
soctherm: OC ALARM 0x00000001
soctherm: OC ALARM 0x00000005
soctherm: OC ALARM 0x00000000

Can we find somewhere what is the difference between different oc alarms ?
Does the system always go in throttling mode when reach the values from warn_current_limit_0 & crit_current_limit_0 ? Or first level is just warning when the value from warn_current_limit_0 is reached and then if the devices is heavily used and reaches crit_current_limit_0 then goes in throttling mode ?

We’re checking VDD_IN value in tegrastats logs and it seems to be in range 8000-10000
but OC ALARM message appears If crit_current_limit_0 is set to 3000ma. Did we calculate something wrong? I think it should be in throttling mode if the board reaches 15W.
File /sys/devices/c250000.i2c/i2c-7/7-0040/iio:device0/in_voltage0_input contains value 4984.

RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [20%@1190,23%@1190,38%@1190,24%@1190,18%@1190,22%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 98%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.75C VDD_IN 8532/8532 VDD_CPU_GPU_CV 3668/3668 VDD_SOC 2428/2428
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [24%@1190,20%@1190,33%@1190,25%@1190,20%@1190,22%@1190] EMC_FREQ 29%@1866 GR3D_FREQ 98%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@28.5C thermal@28.75C VDD_IN 8452/8492 VDD_CPU_GPU_CV 3588/3628 VDD_SOC 2468/2448
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [30%@1190,19%@1190,45%@1190,27%@1190,21%@1190,22%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 4%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.75C VDD_IN 8692/8558 VDD_CPU_GPU_CV 3741/3665 VDD_SOC 2468/2454
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [22%@1190,20%@1190,40%@1190,25%@1190,24%@1190,16%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 1%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.9C VDD_IN 8771/8611 VDD_CPU_GPU_CV 3861/3714 VDD_SOC 2468/2458
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [27%@1190,21%@1190,34%@1190,31%@1190,18%@1190,20%@1190] EMC_FREQ 29%@1866 GR3D_FREQ 2%@752 NVENC 499 NVENC1 499 VIC_FREQ 4%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@28.5C thermal@28.9C VDD_IN 8692/8627 VDD_CPU_GPU_CV 3781/3727 VDD_SOC 2468/2460
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [24%@1190,20%@1190,38%@1190,29%@1190,17%@1190,20%@1190] EMC_FREQ 29%@1866 GR3D_FREQ 25%@752 NVENC 499 NVENC1 499 VIC_FREQ 25%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.9C VDD_IN 8572/8618 VDD_CPU_GPU_CV 3668/3717 VDD_SOC 2468/2461
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [30%@1190,23%@1190,36%@1190,25%@1190,19%@1190,22%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 39%@752 NVENC 499 NVENC1 499 VIC_FREQ 21%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@28.9C VDD_IN 8532/8606 VDD_CPU_GPU_CV 3662/3709 VDD_SOC 2468/2462
RAM 4340/7773MB (lfb 260x4MB) SWAP 0/3886MB (cached 0MB) CPU [30%@1190,25%@1190,36%@1190,26%@1190,19%@1190,19%@1190] EMC_FREQ 29%@1866 GR3D_FREQ 58%@752 NVENC 499 NVENC1 499 VIC_FREQ 0%@601 APE 150 MTS fg 0% bg 5% AO@29.5C GPU@30.5C iwlwifi@32C PMIC@50C AUX@28C CPU@28.5C thermal@28.9C VDD_IN 8412/8581 VDD_CPU_GPU_CV 3588/3694 VDD_SOC 2428/2458
RAM 4350/7773MB (lfb 259x4MB) SWAP 0/3886MB (cached 0MB) CPU [35%@1190,26%@1190,39%@1190,34%@1190,67%@1190,26%@1190] EMC_FREQ 30%@1866 GR3D_FREQ 98%@752 NVENC 499 NVENC1 499 VIC_FREQ 6%@601 APE 150 MTS fg 0% bg 6% AO@29.5C GPU@30C iwlwifi@32C PMIC@50C AUX@28C CPU@29C thermal@29.05C VDD_IN 8891/8616 VDD_CPU_GPU_CV 4020/3730 VDD_SOC 2468/2459

So we defined our custom power mode with reduced CPU & GPU frequencies, but would like to know more about VDD_IN value and how prevent OC ALARMS.
With predefined mode 15W 6 cores the message should be visible if the device uses more than 15W and critical when uses 25W, correct?

system · March 7, 2023, 6:36am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
The error message "System throttling due to over-current" appears when running YOLOv4 Jetson Xavier NX power , nvbugs	28	12749	August 11, 2021
System throttled due to over-current? Jetson Xavier NX power , nvbugs	56	29520	July 21, 2021
System Throttled due to Overcurrent in Xavier NX Jetson Xavier NX camera , encoder	6	1455	November 8, 2022
Xavier NX Over-current JP 5.1.1 Jetson Xavier NX tensorrt , power	10	1217	June 27, 2023
Xavier NX: soctherm: OC ALARM 0x00000002 Jetson Xavier NX power	4	3467	March 2, 2022
Xavier NX OC Throttling under heavy GPU load Jetson Xavier NX power	4	1846	November 24, 2021
NVIDIA Jetson Xavier NX OC Alarm Jetson Xavier NX power	15	331	April 19, 2024
Jetson Xavier NX: System throttled due to Over-current Jetson Xavier NX power	29	3155	May 27, 2023
Questions about system throttling due to overcurrent Jetson Xavier NX power , thermal , power_estimator	5	1865	July 13, 2022
System throttled due to Over-current in Xavier NX Jetson Xavier NX power	2	719	December 21, 2022

System throttling due to over-current on JP4.6.2

Related topics