CPU frequency very low since the system started

Hello, we have discovered a very rare issue with our product. After the system starts, the CPU frequency remains consistently low and the voltage is also abnormal. Please take a look at it. The log has been uploaded as an attachment. Base BSP is r35.3.1.Thank you.
tegrastats.txt (139.0 KB)
dmesg.txt (4.2 MB)

What is the result of

grep “” /sys/class/hwmon/hwmon*/oc*

I checked, and there are no files with “oc” in their names under the /sys/class/hwmon/hwmon0~3 directories.

/sys/class/hwmon/hwmon0# ls
device  name  power  subsystem  temp1_input  uevent

What is the result of

ls -al /sys/class/hwmon/

Like this

~# ls -al /sys/class/hwmon/
total 0
drwxr-xr-x  2 root root 0 Mar 27  2023 .
drwxr-xr-x 84 root root 0 Mar 27  2023 ..
lrwxrwxrwx  1 root root 0 Mar 27  2023 hwmon0 -> ../../devices/virtual/thermal/thermal_zone5/hwmon0
lrwxrwxrwx  1 root root 0 Mar 27  2023 hwmon1 -> ../../devices/platform/39c0000.tachometer/hwmon/hwmon1
lrwxrwxrwx  1 root root 0 Mar 27  2023 hwmon2 -> ../../devices/platform/pwm-fan/hwmon/hwmon2
lrwxrwxrwx  1 root root 0 Mar 27  2023 hwmon3 -> ../../devices/platform/c250000.i2c/i2c-7/7-0040/hwmon/hwmon3

What is the cpu frequency if you don’t give any load there and running with sudo jetson_clocks?

Because this issue is relatively difficult to reproduce, I have executed the results on the module that previously had the problem as follows(currently the nx cpu freq is normal):

$ sudo jetson_clocks --show
SOC family:tegra194  Machine:NVIDIA Jetson Xavier NX Developer Kit
Online CPUs: 0-5
cpu0: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu4: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu5: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
GPU MinFreq=1109250000 MaxFreq=1109250000 CurrentFreq=1109250000
EMC MinFreq=204000000 MaxFreq=1866000000 CurrentFreq=1866000000 FreqOverride=1
DLA0_CORE:   Online=1 MinFreq=0 MaxFreq=1100800000 CurrentFreq=1100800000
DLA0_FALCON: Online=1 MinFreq=0 MaxFreq=640000000 CurrentFreq=640000000
DLA1_CORE:   Online=1 MinFreq=0 MaxFreq=1100800000 CurrentFreq=1100800000
DLA1_FALCON: Online=1 MinFreq=0 MaxFreq=640000000 CurrentFreq=640000000
PVA0_VPS0: Online=1 MinFreq=0 MaxFreq=819200000 CurrentFreq=819200000
PVA0_VPS1: Online=1 MinFreq=0 MaxFreq=819200000 CurrentFreq=819200000
PVA0_AXI:  Online=1 MinFreq=0 MaxFreq=601600000 CurrentFreq=601600000
PVA1_VPS0: Online=1 MinFreq=0 MaxFreq=819200000 CurrentFreq=819200000
PVA1_VPS1: Online=1 MinFreq=0 MaxFreq=819200000 CurrentFreq=819200000
PVA1_AXI:  Online=1 MinFreq=0 MaxFreq=601600000 CurrentFreq=601600000
CVNAS MinFreq=0 MaxFreq=576000000 CurrentFreq=576000000
FAN Dynamic Speed control=active hwmon2_pwm1=0
NV Power Mode: MODE_20W_6CORE

Does your previous case run on MAXN mode? It sounds like a throttling case happened.

Yes, I didn’t change the power mode. I want to know if this is a quality issue, a software issue, or a hardware power supply issue, and how to troubleshoot it.

Hi,

It is not an issue. If you ran in maxN mode, it means this situation is probably due to over current. The system tries to protect the system so throttled your system frequency.

如果你聽不懂的話我可以用中文解釋一次. 感覺前面有些回應好像你沒有真的理解

请问MODE_20W_6CORE是maxN模式么,这种模式有概率引起过流保护是么。

Hi,

我看了一下你的dmesg. kernel跟dts都有更動過. 請問這個issue是否是在custom board上複製的?
請問你有嘗試在NV devkit上複製出這問題嗎?
你前面的/sys/class/hwmon/ 看來有缺少一些node. 感覺狀態有點奇怪.

是的,是我们自己产品上面出现的这个问题,我这边查了一下配置,应该是下面两个配置没有打开的原因,请问这个没有打开会导致cpu降频么

+CONFIG_TEGRA23X_OC_EVENT=y
+CONFIG_TEGRA19X_OC_EVENT=y

这个问题在我们产品上也很小概率才会出现,大部分产品没有发现类似问题。在nvidia evb板上应该也很难复现

麻煩請先打開. 這東西不能自己關掉. 我們沒辦法確定你這樣改完之後機器的行為…

好的,由于功能要求,我们在设备树里面有对usb有一个修改

+       xusb_padctl@3520000 {
+               ports {
+                       usb2-0 {
+                               mode = "host";
+                               status = "okay";
+                       };
+               };
+       };

这个修改之后/sys/class/hwmon/hwmon5会消失,其他目录还在,请问这个是否存在风险,谢谢。

請問你的"hwmon5"本來是link到哪個路徑?

修改之前:

~$ ls -l /sys/class/hwmon/
total 0
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon0 -> ../../devices/virtual/thermal/thermal_zone5/hwmon0
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon1 -> ../../devices/platform/d280000.soctherm-oc-event/hwmon/hwmon1
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon2 -> ../../devices/platform/39c0000.tachometer/hwmon/hwmon2
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon3 -> ../../devices/platform/pwm-fan/hwmon/hwmon3
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon4 -> ../../devices/platform/3520000.xusb_padctl/usb2-0/3520000.xusb_padctl:ports:usb2-0:connector/power_supply/usb-charger/hwmon4
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon5 -> ../../devices/platform/c250000.i2c/i2c-7/7-0040/hwmon/hwmon5

修改之后:

~$ ls -l /sys/class/hwmon/
total 0
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon0 -> ../../devices/virtual/thermal/thermal_zone5/hwmon0
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon1 -> ../../devices/platform/d280000.soctherm-oc-event/hwmon/hwmon1
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon2 -> ../../devices/platform/39c0000.tachometer/hwmon/hwmon2
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon3 -> ../../devices/platform/pwm-fan/hwmon/hwmon3
lrwxrwxrwx 1 root root 0 Sep  8 05:58 hwmon4 -> ../../devices/platform/c250000.i2c/i2c-7/7-0040/hwmon/hwmon4

消失的不是hwmon5.

是因為這一個不見了, 原本的hwmon5現在直接被enumerated成hwmon4. 對於原本的功能沒有影響.

lrwxrwxrwx 1 root root 0 Sep 8 05:58 hwmon4 → …/…/devices/platform/3520000.xusb_padctl/usb2-0/3520000.xusb_padctl:ports:usb2-0:connector/power_supply/usb-charger/hwmon4

好的,请问这个降频的动作是哪里做的,是应用层有一个服务由于没有检测到oc的状态而主动降频的么

請你先把oc event開回來之後我們才能討論…