Fan/temperature trip level issue

I’m seeing some very curious behavior regarding the fan and temperature trip levels. I am seeing a very reproducable system “hesitation” when these trip levels are hit. The OS seems to freeze up for close to a second before continuing on as normal. I am running a Xavier AGX on a rogue board with Jetpack 4.3.
I am running power mode 0 with jetson_clocks running, so I am unsure what is actually happening in this situation:

nvidia@XAVIER:~$ sudo jetson_clocks --show
SOC family:tegra194 Machine:jetson-xavier
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=1 c6=1
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=1 c6=1
cpu2: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=1 c6=1
cpu3: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=1 c6=1
cpu4: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=1 c6=1
cpu5: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=1 c6=1
cpu6: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=1 c6=1
cpu7: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=1 c6=1
GPU MinFreq=318750000 MaxFreq=1377000000 CurrentFreq=1236750000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=0
Fan: speed=255
NV Power Mode: MAXN
nvidia@ROGUE-RENO-B-AGX-43:~$ sudo nvpmodel -q
NV Fan Mode:quiet
NV Power Mode: MAXN
0

I see these entries in /var/log/syslog which correlates to the issues I am having when the system freezes for a almost a second:

Aug 26 12:18:27 XAVIER kernel: [1018963.739263] FAN cooling trip_level:3 cur_temp:72900 trip_temps[4]:81000
Aug 26 12:18:59 XAVIER kernel: [1018995.263747] FAN cooling trip_level:2 cur_temp:63900 trip_temps[3]:72000
A

Can anyone explain what is happening and how to possibly address it?

hello peterqissel,

you’re running at MaxN power modes. may I know what’s the use-case or what’s the application you’re ran to create this issue?
BTW,
could you please refer to Topic 129065 to have thermal zone customization in the device tree.
thanks

Thanks for the link, I will check that out and see if it will help solve our issue.

We are processing multiple hi-res video feeds and feeding them into machine learning algorithms and processing the results, it takes a lot of CPU and a lot of memory.

One of my biggest questions is why the trip levels have any affect at all given that the device is already in power mode zero and jetson_clocks is running the fan at full speed (255)? And in any case, why would I see a performance degradation that appears to freeze the system at one of these trip levels?

2 Likes

hello peterqissel,

may I know what’s your use-case, also how many cameras you’re used, and what’s the resolution it is.
the performance degradation you seen should caused by Hardware Throttling, you may also enable tegrastats utility to monitor the memory usage and processor usages.
thanks

We are currently trying to process 5 video feeds of 4096 x 2160 at 15 fps.
I’m not seeing any indication that hardware throttling is occurring, the temperatures are not that high, and the issue only last for approx 1 second.
I have been attempting to duplicate the issue with a simpler situation and so far have not reproduced the issue, but I have seen some interesting things in the logs.

[ 2108.263065] FAN rising trip_level:1 cur_temp:50000 trip_temps[2]:63000
[ 3327.894143] FAN rising trip_level:2 cur_temp:63000 trip_temps[3]:72000
[ 4646.066912] FAN cooling trip_level:1 cur_temp:54950 trip_temps[2]:63000
[ 7805.435860] FAN rising trip_level:2 cur_temp:63000 trip_temps[3]:72000
[ 9129.212917] FAN cooling trip_level:1 cur_temp:54950 trip_temps[2]:63000
[13705.317626] FAN rising trip_level:2 cur_temp:63000 trip_temps[3]:72000
[15854.495990] FAN rising trip_level:1 cur_temp:59000 trip_temps[2]:63000
[15858.975771] FAN rising trip_level:1 cur_temp:58800 trip_temps[2]:63000

Those logs show three consecutive rising trip_levels when the temperature is cooling on the last two.

hello peterqissel,

we cannot guarantee your use-case stability since your use-case did not included in the Software Features.
FYI,
please refer to developer guide, you may check CSI and USB Camera Features, there’s validated 4K preview with two cameras.
thanks

Thank you for your informative reply, I was not aware of those specific use cases. I was operating under a more general assumption:
3840 x 2160 x 60 x 2 = 995,328,000 (in the link you sent me), this is 2 4K 60 fps video feeds.
1920 x 1440 x 30 x 6 = 497,664,000 (in the link you sent me), this is 6 30 fps video feeds

What we are attempting:
4096 x 2186 x 14 x 5 = 626,769,920 - this is the resolution of our cameras at 14 fps, 5 video feeds. Since the total throughput was between the two officially supported use cases mentioned above, we assumed the system would be able to handle the total throughput of what we are attempting.
It sounds like you are saying we can not rely on the system being able to handle the total throughput of this particular configuration, since it is different than the officially supported use cases mentioned above.

Would that be an accurate statement?

hello peterqissel,

it’s more complicated for multi-cam use-case, there’s an instance for each camera stream, EGLStream to render preview frames. even though your throughput below dual 4K@60 preview. we’re only validate six-cam preview at 1920x1440@30-fps.