my last situation before running jtop --restore was that, although I set temp_control to 1 and target_pwm to 0, that after every reboot of the Xavier, the fan always went to 40% and stayed there. (40% was a setting I did at some point through manual control in jtop)
When running jtop --restore the situation normalised to the fan not starting after a reboot, and starting when temperatures got higher.
This led me to conclude jtop does some kind of overriding of the default OS fan control, but I could be completely wrong of course.
Does that help in any way to investigate what is happening?
On a side note. Is there any condition possible jtop could get into a loop and permanently pushing CPU1 to 100%? This is what led me to start experimenting with fan control options. (I can’t reproduce it right now however)
honestly I never notice that, I tried on different NVIDIA Jetson and I didn’t seen anything like that, but I will take an extra check during this days.
There is only a condition, but require to change the default configuration of jtop, using: jtop -r 100 (or number really low). In this case jtop change tegrastats at a different high frequency (usually tegrastats works at 500ms) and in this case you can see an overload of one cpu.
Anyway, now your board works fine?
Can you test for me this configuration:
jetson_clocks
/sys/devices/pwm-fan/target-pwm=0
sys/devices/pwm-fan/temp_control=1
Apply a load
check if the fan start automatically with high temperature
@rbonghi I have done the sequence you asked for and put some load on the Xavier, and the fan turns on automatically when avg. temperatures reach 50C.
I have also given the Xavier the time to cool down and I can confirm the fan switching off when avg. temperatures drop beneath 32C.
It seems in my case I’m back to a normal situation, and that it was solved by running the jtop --restore.
I’m wondering if perhaps nvpmodel changes allowed ranges such that you are seeing this in the “/sys” files…don’t know. If there is a new minimum for certain values under different models, then I might expect the fan could be part of that (purely speculation).
The last few days I again had 2 occasions where CPU1 usage got up to 100% and the Xavier heating up, without the fan starting, and the Xavier shutting down with a panic.
The last time nothing was running (as in being in an idle state after a reboot, with just a ssh connected).
After the last panic shutdown and reboot, CPU1 went again to 100% but then gradually started to go down to 0%, without any processes in htop showing abnormal CPU usage.
I have now only jtop and htop running. I hardly see any load on the CPU’s (all < 5%) or GPU.
Nevertheless the temperature is slowing going up.
After about 30 minutes CPU1 slowly starts speeding up (going from 0% to 100% in around 10 minutes), with not processes in htop showing additional CPU usage. (see attached jtop output)
And again a panic reboot :-(
Any idea’s what I can do to find out what is going on?
My last resort is completely reflashing the Xavier…
And after the panic reboot, now the fan turns on because after the reboot avg. temperatures were still above 50C.
After cooling down to under 32C, the whole cycle repeats again (heating up → CPU1 100% → panic)
Maybe you missed it my post (the one with the jtop screen dumps) that I do use htop to also monitor if there are any processes misbehaving.
Do I have to use any htop command line options to see more details?
I have tried again, now also with K option enabled in htop, and I get the same behaviour: temp going up, cpu1 going to 100% in jtop and in htop, nothing special to see in any of the processes in htop.
The cpu1 usage going up happens around avg. temp of 45C, so before the fan can spin up.
Is it possible the clock speed of CPU 1 is throttled down, so that the what runs at 5% before relatively shows a higher percentage?
(jtop still shows CPU speed above 2.1GHz)
Unless someone has an idea, I’m afraid I’m going to have to reflash because I need the Xavier to get some work done.
And now I have the same problem of the Xavier heating up without any load, CPU1 going to 100% at a certain moment in time, and a panic crash.
These steps should not give CPU to 100%… I guess we have two issues here.
For the fan issue, I just had a small test with my Xavier yesterday. It seems the combination of temp_control=1 + target_pwm =0 would not make the fan auto-start when temperature goes up. I need to check this behavior with our thermal driver expert again. Currently, please do not use the setting.
As for the unknown workload on cpu1, are you able to observe anything on jetson-stats or tegra-stats?How is the temperature status when CPU goes panic?
Thanks for looking into this.
I will wait for a while to replace my Xavier until you had the chance to talk to the thermal driver expert.
To me it looks like the temp_control=1 + target_pwm=0 setting works sometimes, and sometimes it does not.
The thing that really worries me is that I had this behaviour right after reflashing…
I have now set the Xavier in ‘cool’ fan mode and I’m currently testing when I put the Xavier under load.
The fan did turn on to 30% with tem_control=1 + target_pwm=0, and when avg. temp goes to +50C the fan spins up to 47%, and at 47% temperatures seem to stabilise under the current load. So this looks like normal behaviour under load.
I will let it run like this for another hour and after this test, I will let it cool down, reboot it, and don’t put any load on it to see wether I still have this behaviour of gradually getting hotter and CPU1 going to 100%, while doing nothing.
I will try to log tegrastats to a file, and send it to you when I get a panic.