Xavier Fan Trouble (re-enable DVFS)

@rbonghi

my last situation before running jtop --restore was that, although I set temp_control to 1 and target_pwm to 0, that after every reboot of the Xavier, the fan always went to 40% and stayed there. (40% was a setting I did at some point through manual control in jtop)
When running jtop --restore the situation normalised to the fan not starting after a reboot, and starting when temperatures got higher.

This led me to conclude jtop does some kind of overriding of the default OS fan control, but I could be completely wrong of course.
Does that help in any way to investigate what is happening?

On a side note. Is there any condition possible jtop could get into a loop and permanently pushing CPU1 to 100%? This is what led me to start experimenting with fan control options. (I can’t reproduce it right now however)

jtop override the fan control only when you change in manual control (using ‘f’)
This is the service that change the speed at boot: https://github.com/rbonghi/jetson_stats/blob/master/scripts/jetson_fan.sh otherwise jtop doesn’t change nothing.

honestly I never notice that, I tried on different NVIDIA Jetson and I didn’t seen anything like that, but I will take an extra check during this days.

There is only a condition, but require to change the default configuration of jtop, using: jtop -r 100 (or number really low). In this case jtop change tegrastats at a different high frequency (usually tegrastats works at 500ms) and in this case you can see an overload of one cpu.

Anyway, now your board works fine?
Can you test for me this configuration:

  • jetson_clocks
  • /sys/devices/pwm-fan/target-pwm=0
  • sys/devices/pwm-fan/temp_control=1
  • Apply a load
  • check if the fan start automatically with high temperature
  • I will do the same tonight.

    Best,
    Raffaello

    @bonghi When you say ‘jetson_clocks up’ do you mean just running jetson_clocks? Because jetson_clocks does not accept a parameter ‘up’

    yes, right!

    Only run jetson_clocks. I fix the previous message

    @rbonghi I have done the sequence you asked for and put some load on the Xavier, and the fan turns on automatically when avg. temperatures reach 50C.
    I have also given the Xavier the time to cool down and I can confirm the fan switching off when avg. temperatures drop beneath 32C.

    It seems in my case I’m back to a normal situation, and that it was solved by running the jtop --restore.

    Any other scenario’s I can test for you?

    Cool thank you!

    I will try again and I think I will re enable this feature on jtop!

    Thank you a lot! :-)

    You are very welcome.

    I’m happy my board is back to a normal state, but I’m still curious where I went wrong to get these strange behaviours I noticed before.
    Any idea’s?

    If you have again a CPU to 100% open htop and find there which is the process that use this resource.

    I don’t have other ideas to manage this issue. :-(

    I’m wondering if perhaps nvpmodel changes allowed ranges such that you are seeing this in the “/sys” files…don’t know. If there is a new minimum for certain values under different models, then I might expect the fan could be part of that (purely speculation).

    A valuable suggestion, but I always had it on nvpmodel 0.

    rbonghi,

    Which platform did you hit this issue? Nano or Xavier? Do you remember any application that is running on device? or just in idle?

    I remember I used a Xavier, but I tried time ago (around September)

    When I take this type of test is usually in idle.

    Again this is getting stranger by the day.

    The last few days I again had 2 occasions where CPU1 usage got up to 100% and the Xavier heating up, without the fan starting, and the Xavier shutting down with a panic.
    The last time nothing was running (as in being in an idle state after a reboot, with just a ssh connected).

    After the last panic shutdown and reboot, CPU1 went again to 100% but then gradually started to go down to 0%, without any processes in htop showing abnormal CPU usage.

    I have now only jtop and htop running. I hardly see any load on the CPU’s (all < 5%) or GPU.
    Nevertheless the temperature is slowing going up.
    After about 30 minutes CPU1 slowly starts speeding up (going from 0% to 100% in around 10 minutes), with not processes in htop showing additional CPU usage. (see attached jtop output)
    And again a panic reboot :-(

    Any idea’s what I can do to find out what is going on?
    My last resort is completely reflashing the Xavier…

    NVIDIA Jetson AGX Xavier - Jetpack 4.3 [L4T 32.3.1]
    CPU1 [|||||||||||||||||||||||||||||||||||||||||||||Schedutil -  92%] 2.3GHz CPU5 [|                                            Schedutil -   3%] 2.3GHz
    CPU2 [                                             Schedutil -   1%] 2.3GHz CPU6 [                                             Schedutil -   0%] 2.3GHz
    CPU3 [                                             Schedutil -   0%] 2.3GHz CPU7 [                                             Schedutil -   1%] 2.3GHz
    CPU4 [|                                            Schedutil -   3%] 2.3GHz CPU8 [|                                            Schedutil -   3%] 2.3GHz
    
    MTS FG [                                                                  0%] BG [                                                                  0%]
    Mem [||                                                                                                                      0.7G/31.9GB] (lfb 7652x4MB)
    Swp [                                                                                                                         0.0GB/16.0GB] (cached 0MB)
    EMC [                                                                                                                                         0%] 2.1GHz
    
    GPU [                                                                                                                                         0%] 318MHz
    Dsk [########################################################                                                                             10.9GB/27.4GB]
                          [info]                       [Sensor]   [Temp]                                  [Power/mW]   [Cur]   [Avr]
    UpT: 0 days 0:48:27                                AO         46.50C                                  CPU          1397    599
    FAN [                                  0%] Ta=  0% AUX        46.50C                                  CV           0       0
    Jetson Clocks: inactive                            CPU        48.50C                                  GPU          0       0
    NV Power[0]: MAXN                                  GPU        48.50C                                  SOC          2638    2509
    APE: 150MHz                                        PMIC      100.00C                                  SYS5V        3330    3304
    HW engine:                                         Tboard     47.00C                                  VDDRQ        775     775
     ENC: NOT RUNNING                                  Tdiode     49.25C                                  Total        8140    7187
     DEC: NOT RUNNING                                  thermal    47.70C
    
    NVIDIA Jetson AGX Xavier - Jetpack 4.3 [L4T 32.3.1]
    CPU1 [|||||||||||||||||||||||||||||||||||||||||||||Schedutil - 100%] 2.3GHz CPU5 [|                                            Schedutil -   2%] 2.3GHz
    CPU2 [                                             Schedutil -   0%] 2.3GHz CPU6 [                                             Schedutil -   0%] 2.3GHz
    CPU3 [                                             Schedutil -   1%] 2.3GHz CPU7 [                                             Schedutil -   1%] 1.9GHz
    CPU4 [                                             Schedutil -   1%] 2.1GHz CPU8 [|                                            Schedutil -   2%] 2.3GHz
    
    MTS FG [                                                                  0%] BG [                                                                  0%]
    Mem [||                                                                                                                      0.8G/31.9GB] (lfb 7651x4MB)
    Swp [                                                                                                                         0.0GB/16.0GB] (cached 0MB)
    EMC [                                                                                                                                         0%] 2.1GHz
    
    GPU [                                                                                                                                         0%] 318MHz
    Dsk [########################################################                                                                             10.9GB/27.4GB]
                          [info]                       [Sensor]   [Temp]                                  [Power/mW]   [Cur]   [Avr]
    UpT: 0 days 0:49:54                                AO         47.00C                                  CPU          1552    623
    FAN [                                  0%] Ta=  0% AUX        46.50C                                  CV           0       0
    Jetson Clocks: inactive                            CPU        49.00C                                  GPU          0       0
    NV Power[0]: MAXN                                  GPU        49.00C                                  SOC          2638    2513
    APE: 150MHz                                        PMIC      100.00C                                  SYS5V        3370    3305
    HW engine:                                         Tboard     47.00C                                  VDDRQ        775     775
     ENC: NOT RUNNING                                  Tdiode     49.75C                                  Total        8335    7216
     DEC: NOT RUNNING                                  thermal    47.70C
    

    And after the panic reboot, now the fan turns on because after the reboot avg. temperatures were still above 50C.
    After cooling down to under 32C, the whole cycle repeats again (heating up → CPU1 100% → panic)

    Hi Herman,

    try to use htop and not jtop to find which is the process that use your CPU.

    Htop can order all process and you can find easily were all resources are used.

    Hi Raffaello,

    thanks again for your help.

    Maybe you missed it my post (the one with the jtop screen dumps) that I do use htop to also monitor if there are any processes misbehaving.
    Do I have to use any htop command line options to see more details?

    I have tried again, now also with K option enabled in htop, and I get the same behaviour: temp going up, cpu1 going to 100% in jtop and in htop, nothing special to see in any of the processes in htop.

    The cpu1 usage going up happens around avg. temp of 45C, so before the fan can spin up.

    Is it possible the clock speed of CPU 1 is throttled down, so that the what runs at 5% before relatively shows a higher percentage?
    (jtop still shows CPU speed above 2.1GHz)

    Unless someone has an idea, I’m afraid I’m going to have to reflash because I need the Xavier to get some work done.

    BROKEN XAVIER???

    I did the following steps to get my Xavier back into it’s minimal state to use it again:

    • I reflashed the Xavier from scratch
    • mounted installed mvme drive as /home
    • sudo nvpmodel -m 0
    • sudo apt update
    • sudo apt upgrade
    • sudo apt install python3-pip
    • python3 -m pip install --upgrade pip
    • sudo -H pip install jetson-stats
    • sudo apt install curl
    • curl -sSL https://get.docker.com/ | sh

    And now I have the same problem of the Xavier heating up without any load, CPU1 going to 100% at a certain moment in time, and a panic crash.

    Do I consider this Xavier as broken, and should it be replaced?

    And now I have the same problem of the Xavier heating up without any load, CPU1 going to 100% at a certain moment in time, and a panic crash.

    These steps should not give CPU to 100%… I guess we have two issues here.

    For the fan issue, I just had a small test with my Xavier yesterday. It seems the combination of temp_control=1 + target_pwm =0 would not make the fan auto-start when temperature goes up. I need to check this behavior with our thermal driver expert again. Currently, please do not use the setting.

    As for the unknown workload on cpu1, are you able to observe anything on jetson-stats or tegra-stats?How is the temperature status when CPU goes panic?

    Hi Wayne,

    Thanks for looking into this.
    I will wait for a while to replace my Xavier until you had the chance to talk to the thermal driver expert.
    To me it looks like the temp_control=1 + target_pwm=0 setting works sometimes, and sometimes it does not.

    The thing that really worries me is that I had this behaviour right after reflashing…
    I have now set the Xavier in ‘cool’ fan mode and I’m currently testing when I put the Xavier under load.
    The fan did turn on to 30% with tem_control=1 + target_pwm=0, and when avg. temp goes to +50C the fan spins up to 47%, and at 47% temperatures seem to stabilise under the current load. So this looks like normal behaviour under load.

    I will let it run like this for another hour and after this test, I will let it cool down, reboot it, and don’t put any load on it to see wether I still have this behaviour of gradually getting hotter and CPU1 going to 100%, while doing nothing.
    I will try to log tegrastats to a file, and send it to you when I get a panic.