Xavier Fan Trouble (re-enable DVFS)

I’m having fan troubles on the Xavier Development Kit, in that at some point in time the fan started running at full speed without any high temperatures, as measured by jtop (GitHub - rbonghi/jetson_stats: 📊 Simple package for monitoring and control your NVIDIA Jetson [Xavier NX, Nano, AGX Xavier, TX1, TX2]).
I have no explanation why it started doing so, although I noticed a few times CPU1 going to 100% for longer times without any processes in the top tool showing high cpu consumption.
Rebooting did not solve the issue.

I’m running on an AGX Xavier Developer Kit JetPack 4.3 [L4T 32.3.1] nvpmodel -m 0
Let me recap what I (think I) learned until now:

When running /usr/bin/jetson_clocks it sets /sys/devices/pwm-fan/target-pwm to the FAN_SPEED value (a number between 0 and 255), and it sets the /sys/devices/pwm-fan/temp_control to 0 (probably to disable automatic adjustment of fan speed based on temperatures?)
When the FAN_SPEED variable in jetson_clocks is 255 the fan speeds up indeed to its maximum.

When I edit /sys/devices/pwm-fan/target-pwm to a lower value I can indeed see the fan go to a lower speed, but it stays on that speed regardless the temperatures.

I was hoping when I would reset /sys/devices/pwm_fan/temp_control back to 1 and /sys/devices/pwm-fan/target-pwm to 0 I would get back automatic fan speeds based on temperatures (DVFS), but unfortunately that does not seem to be the case, in that now the fan does never turn on, even with high temperatures, up to an automatic shutdown of the Xavier.
I also tried to rerun nvpmodel -m 0, with the same result of the fan never turning on anymore.

I also can’t do a jetson_clocks --restore because I did not do a --store before running jetson_clocks.

I have now a temporary solution by setting /sys/devices/pwm-fan/target-pwm to 102 so it runs permanently at 40% which is enough most of the time, but every now and than I have to set it higher and back lower again.

Is there any way for me to re-enable DVFS so the fan speed automatically adjusts to the temperatures?

hello herman.jansen,

it’s true that enable jetson_clock would put the fan speed to maximum.
could you please refer to developer guide, please check Fan Mode Control chapter for fan modes configuration.
thanks

Hi herman.jansen,

I don’t think your operation would affect the DVFS table. Would you mind re-flashing your board and see if this issue is still?
If so, could you share the tegrastats when the fan automatically starts?

If you look at the actual “jetson_clocks” script, it is human readable bash shell. The “do_fan()” function could be edited, or simply used to create a different script (if you edit don’t forget to save an original too). The fan speed is just an echo into a “/sys” file, where 255 is maxed out (and yes is unrelated to DVFS). The “auto” setting is “0”, the “max” setting is “255”, and anything in between is exactly what it would seem to be. Check this before and after running “jetson_clocks”:

sudo cat /sys/devices/pwm-fan/target_pwm

Auto would probably be fine, but if there is a temperature throttle, then DVFS would still have a momentary reduction until the fan speeds up and cools the system down again. Can you tolerate a momentary throttle? If not, then you want 255 (max), but if you can tolerate the lag between heating up and the fan cooling things back down, then 0 (auto) is good enough.

Thanks for the help. It’s a friendly neighbourhood over here.

I’m really starting to doubt myself.
After reading the documentation and applying the following settings:

  • nvpmodel -q output: NV Fan Mode:quiet NV Power Mode: MAXN 0
  • /sys/devices/pwm_fan/temp_control is set to 1
  • /sys/devices/pwm-fan/target-pwm is set 0

and putting the Xavier under load until the average temperature of the different temperature sensors goes over 50C, now the fan turns on again like it should.
I’m pretty sure I did the same thing before with no effect, but as I said, I’m starting to doubt myself.
So, for the moment my issue is solved. Thanks again for the help.

I can only guess, but the being able to control the fan is useful only when there is some other program looking at current temperature and deciding what speed the fan is at. If you were to enable auto fan, but in some way ignore or disable the program which wants to set a fan speed, then it would fail. Perhaps something decided that if the fan was on max, then the temp_control should also run differently. I have not examined the temp_control changes, nor their relation to the jetson_clocks script.

I don’t know what is going on with temperature management on my Xavier.

I left it running this night, with nothing else active than jupyter lab, but no processes really running.
When I wanted to continue work today I saw the Xavier had a panic (CPU1 unresponsive for 21 seconds) and became unresponsive, and feeling very hot from the outside (even after the panic somewhere during the night).

When I rebooted the Fan came on and the Xavier started to cool down again. It must have failed to turn on the fan however, otherwise it would not have become so hot, while the test I did yesterday showed the fan coming on when I did put a heavier load on the Xavier.

It looks like the problem I described in my original post where I noticed CPU1 going to 100% for longer times without showing any processes utilising the CPU (top).

Unless someone has an idea about what could be going wrong, I suppose I will have to reflash and reconfigure from scratch…
and never leave it switched on unattented because I don’t know what damages are caused by these high temperatures,

If you first run “nvpmodel -m 0”, but do not run “jetson_clocks” (nor anything for fan adjustment), then what do you see from:

sudo -s
cd /sys/
grep -i '.*' kernel/debug/tegra_fan/* devices/thermal-fan-est/temps

(note that “grep” will give the name of the file being monitored, so it is more convenient then “cat” when monitoring several files)

Then run jetson_clocks, followed by the same command. Post the new output under jetson_clocks (wait about 30 seconds after running jetson_clocks before you run the second grep). This should give a bit more detail on what the fan is being told to run.

FYI, if you were to log in via serial console, then you could run this and the final output would be available (visible on the PC with the serial console app) even after the system locks up:

sudo -s
cd /sys/
watch -n 1 "grep -i '.*' kernel/debug/tegra_fan/* devices/thermal-fan-est/temps"

It isn’t a good idea to purposely run a unit till it locks up due to temperature, but if you are debugging and it is going to do this anyway, then having data is a good idea. “grep” plus “watch -n 1” over serial console will make sure the data isn’t lost.

This is the grep output after a fresh reboot (thanks for the grep tip):

devices/thermal-fan-est/temps:[0] 25000 25500 25500 25500 25500 25500 25000 25500 25500 25000 25000 25500 25500 25000 25000 25500 25500 25000 25500 25000
devices/thermal-fan-est/temps:[1] 25000 25500 25500 25500 25500 25500 25000 25500 25500 25000 25000 25500 25500 25000 25000 25500 25500 25000 25500 25000
devices/thermal-fan-est/temps:[2] 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000 24000

And this is the output after jetson_clocks and waiting for more then 30 seconds:

devices/thermal-fan-est/temps:[0] 29000 28500 28500 29000 28500 28500 28500 28500 28500 28500 28500 29000 28500 28500 28500 28000 28500 28500 28500 28500
devices/thermal-fan-est/temps:[1] 27000 27000 27000 27000 27000 27000 27000 26500 27000 27000 27000 27000 27000 27000 27000 27000 27000 27000 27000 27000
devices/thermal-fan-est/temps:[2] 27000 27000 27000 27000 26500 26500 27000 27000 27000 27000 27000 26500 27000 27000 26500 27000 27000 27000 27000 27000

I will now reboot again, and start jupyter lab like I did yesterday, and start the watch on a console. I will report back after I let it running for night again.

The mysterie continues.

I set temp_control back to 1 and target_pwm to 0.
When I now reboot the Xavier the fan goes to 40% (target_pwm 102), while all temperatures show temperatures around 30C.
Something seems completely off, and starts to look random to me. I have no idea why it now automatically sets the target_pwm to 102 when I have set it to 0 right before the reboot.

[info]             [Sensor]   [Temp]               [Power/mW]   [Cur]   [Avr]
UpT: 0 days 0:5:46              AO         28.50C               CPU          465     497
FAN [||||||        40%] Ta= 40% AUX        28.50C               CV           0       0
Jetson Clocks: inactive         CPU        30.50C               GPU          0       0
NV Power[0]: MAXN               GPU        30.50C               SOC          2483    2482
APE: 150MHz                     PMIC      100.00C               SYS5V        3250    3250
HW engine:                      Tboard     30.00C               VDDRQ        775     775
 ENC: NOT RUNNING               Tdiode     31.50C               Total        6973    7004
 DEC: NOT RUNNING               thermal    29.70C
devices/thermal-fan-est/temps:[0] 30500 30500 30500 30000 30000 30000 30000 30000 30500 30000 30500 30500 30500 30000 30000 30000 30500 30500 30500 30000
devices/thermal-fan-est/temps:[1] 30500 30500 30500 30000 30000 30000 30000 30000 30500 30000 30500 30500 30500 30000 30000 30000 30500 30500 30500 30000
devices/thermal-fan-est/temps:[2] 28500 28500 28500 28500 28500 28500 28500 28500 28500 28500 28500 28500 29000 28500 28500 28500 28500 28500 28500 29000

Hi herman,

I have also your same issue, but not only for the Xavier also for the Nano.

When jetson_clocks set

/sys/devices/pwm-fan/temp_control=0

I don’t know why if manually set

/sys/devices/pwm-fan/temp_control=1

The OS can’t manage the variable and if the temperature growing up, the OS don’t turn on the fan. (only when jetson_clocks is enabled)

I don’t know the reason honestly.

I follow up this discussion, maybe we can find a way

Raffaello

Hi Raffaello,

I’m glad I’m not the only one.
I also thought the OS would take over again after setting temp_control to 1, but it looks I’m getting some random behaviours now.
At the moment I’m completely puzzled.

I’m sure however with some help we can get it solved.

Are you using the fan controller on jtop in jetson-stats package? (page 4 - CTRL)

Hi Raffaello,

I just noticed you are the author of the jtop tool. Nice to meet you, and thank you for the tool. I use it a lot to monitor resource usage.

I have used the fan control before to change the fan speed manually when I noticed in jtop that the temperatures were going up without the fan speeding up automatically. I have not used it anymore after I changed the the values in temp_control and target_pwm. I suppose changing the manual speed in jtop does the same as setting the speed in target_pwm.

I also think that I used the ‘a’ option at some point. It looks likt it was the same as running jetson_clocks. I’m not sure what the ‘e’ option does, and also not what ‘CTRL=Enable’ means.

Thank you, I’m great jetson-stats is useful! :-)

When you press the button the button ‘f’ you can change fan control, when you read ‘manual’ you set a manual speed and when your board restart will be set again the speed that you set.
To do that there is a file stored in /opt/jetson_stats/fan_config where is written the mode selected and the speed defined. (also there is jetson_fan service that read this file and enable if required the fan)

If you read Jc after the command ‘f’ the fan is controlled by jetson_clocks and there are no other service to change the speed of your fan.

To be sure that this service does not running and this file does not exist in your board, run:

sudo jtop --restore

The command ‘a’ active jetson_performance service that manage the jetson_clocks script, and the command ‘e’ enable jetson_clocks to run at boot.

I tried time ago to enable the temperature control using:

/sys/devices/pwm-fan/temp_control=1

but I don’t know why doesn’t work like you.

I have also an issue opened on my repo: What's the difference among the FAN f "Jc", "Auto" and "Manual" · Issue #41 · rbonghi/jetson_stats · GitHub

After running jtop --restore the fan turns of and stays off after a reboot.
When putting load on the Xavier after the reboot it turns on when avg. temperatures go over 50C, and the fan turns off again when it cools down.

Thanks already for the jtop --restore tip.
It looks like I’m back to normal behaviour, although I have to do some more testing to see wether I don’t run into any of the other issues I noticed before.

Is it possible jetson_stats and the os fan control are fighting each other under certain conditions? Would you rather like to continue further investigation on your repo?

Hi herman.jansen and rbonghi,

Sorry for late reply. I am wondering could you guys share a brief conclusion about what you’ve done here.

I could try to reproduce it with my xavier. AFAIK, it looks like we need to set nvpmodel to mode 0 and toggle the target_pwm and temp_control. When hitting this problem, we should see the fan not work all the time, right?

Hi Wayne,

I don’t have a definitive conclusion yet. After the golden tip form rbonghi to do a sudo jtop --restore, to me it looks indeed like setting nvpmodel to 0, target_pwm to 0 and temp_control to 1 switches back to automatic fan control, based on the different temperature sensors.

I can be completely wrong here, but I’m under the impression, and I hope rbonghi can shed some more light on this, that when you use the fan control options in the jetson_stats jtop tool, it implements also some kind of fan control which maybe can interfere with the OS fan control in some way.

Maybe the thing I did wrong was to use jtop fan control, and also manually set temp_control to 1 which might trigger the OS fan control to become active also, while jtop fan control is still active, leading to some random behaviour because of the 2 fan control systems being active at the same time. I’m purely guessing here. I would have to do some more testing.
As I said before, maybe rbonghi has more insights on what exactly happens when you start using the fan control options in jtop.

Hi herman.jansen,

jetson_stats and jtop should also use the same interfaces which the thermal/fan driver revealed to userspace to show this info and those fan control.

@rbonghi, could you confirm?

IMO, if you set temp_control to 1, the fan should be able to automatically go up when device is in high temperature condition. Please help me check if the fan does not work under this setting.

Hi WayneWWW

good morning!

yes, exactly, jtop control the same thermal/fan driver revealed to userspace.

When I made my wrapper, I tested the combination of:

  • jetson_clocks
  • sys/devices/pwm-fan/temp_control=1
  • but It doesn’t work. I remember the temperature getting high without any type of manage from the OS.

    My original idea was to make a button for jtop to enable again temp_control=1. If it works I will re-enable it.