Jetson nano GPU stuck in high frequency after Deep Learning training

I monitor the current state of my Jetson Nano with jtop.

jetson_clocks is deactivated, this is important to note.
I boot the Jetson Nano to text mode, and operate it without monitor and keyboard (I access it via WIFI), so the GPU has no work at all with a display.

When I train a common MNIST classification model (I use the official Jetson Nano Tensorflow and PyTorch libraries) the GPU frequency scales up to 921 MHz, as expected under load.

The strange behaviour occurs when the scripts finish. After that, the GPU still stays in some higher frequency (like 614 MHz, sometimes even at the max 921 MHz), despite there is no load anymore on the GPU.

The GPU doesn’t scale down after some time; at one point I let the Jetson Nano idling for 2 hours, to no effect. GPU still stuck in the same high frequency.

The only way so far I found to force the GPU to scale its frequency down to 76 MHz is disabling and immediately enabling railgate:

/sys/devices/57000000.gpu# echo 0 > railgate_enable
/sys/devices/57000000.gpu# echo 1 > railgate_enable

The GPU immediately goes to 76 MHz at this point when I re-enable railgate.

Is something wrong with my device, or is this expected behaviour? Is there some system parameter that I can set so that the GPU goes back to its lowest frequency when it has nothing to do?

Hi,

Just want to clarify.
When the training job is finished, the app of deep learning frameworks is also closed. Is that correct?
Would you mind to monitor the system with tegrastats to make sure that the GPU utilization is 0%.

Thanks.

Hi! Thanks for replying!

Yes, it is a textbook vanilla deep learning python script. When training is finished, it is back to command line. The DL framework should be exited completely.

I just reproduced it with a simple training run, afterwards tegrastats shows this output:

RAM 686/3964MB (lfb 210x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [1%@102,3%@102,off,off] EMC_FREQ 6%@204 GR3D_FREQ 0%@537 APE 25 PLL@25C CPU@27.5C iwlwifi@33C PMIC@100C GPU@27C AO@35.5C thermal@27.25C POM_5V_IN 941/3336 POM_5V_GPU 0/907 POM_5V_CPU 128/670

so GPU is definitively at 0% (GR3D_FREQ 0%@537). (edit: re-run tegrastats as root for more information)

jtop shows this:

GPU [ 0%] 537MHz

GPU is at 0% load, but stuck at 537 MHz.

After writing my first post I investigated further and made another interesting observation. When I look at the sys values of the GPU governor at /sys/devices/57000000.gpu/devfreq/57000000.gpu I observe this:

$ cat cur_freq target_freq max_freq min_freq 
537600000
537600000
614400000
76800000

(I made the test in 5W nvpmode=1)

So it looks to me like the GPU freq governor is not scaling down the GPU, but somehow is stuck at 537 MHz. Manually trying to write 76 MHz to target_freq as root only results in a permission denied, but this is expected anyway for a read-only kernel value in sysfs.

Also from jtop and tegrastats the GPU is shown to consume 0W of power. (POM_5V_GPU 0/907)

I have a theory what might be happening: The GPU has absolutely nothing to do after the training (as I operate it headless in text mode so it is not doing any display work), it shuts down completely via railgating. The GPU freq governor keeps the higher frequency a little bit longer (as the GPU just had some minutes of high load). When the GPU governor finally wants to scale down the GPU, it encounters a powered off GPU which doesn’t react to frequency scaling until powered up again, so the governor waits at the last higher frequency.

Well, I’m no expert in hardware, so this is just a wild guess. But does this sound reasonable? Anyway, in that case it could be that the display of 537 MHz is simply an artifact, and it is benign as the GPU is powered down anyway?

Hi,

Sorry for the late update.

We try to reproduce this issue in our environment.
Could you check if this issue still occurs in our latest JetPack4.4 release?

Thanks.

Hi,

We can reproduce this issue with JetPack4.4 but not with JetPack4.3.
Just want to confirm that if you are also using JetPack4.4?

Thanks.

Hi,

We found that increasing the railgate_delay time can be a fast workaround.
You can give it a try.

Thanks.

Hi, thanks a lot for investigating!

To your first question:

We can reproduce this issue with JetPack4.4 but not with JetPack4.3. Just want to confirm that if you are also using JetPack4.4?

Thats interesting, because I am actually using JetPack4.3 so far. I want to upgrade to JP4.4 once a faster microSD card arrives in my current Amazon order, then I will make the test immediately if the GPU is still staying in a higher freq.

So it looks so far that this issue is independent of the JP version.

For your second recommendation:

We found that increasing the railgate_delay time can be a fast workaround.

This solves it indeed. I observe that the GPU governor takes several seconds to scale down GPU frequency, so when I increase this value to 10000 (which should correspond to 10 seconds delay), the GPU indeed is scaled down to 76 MHz before it goes into railgate.

The only issue there still is: When I observe the status of the Jetson Nano in a parallel terminal with jtop, it will then not go back into railgate at all, and lingers activated with around 40 mW power consumption.

I suspect that tegrastats (which is called in the back by jtop) is then keeping the GPU powered up before it has a chance to go into railgate powerdown. The GPU only goes into railgate when I set the polling frequency of jtop / tegrastats to an interval longer than railgate_delay. With a delay time of 10 seconds, this unfortunately is still a little bit suboptimal.

But generally this issue can be seen as resolved. It could only be considered to look deeper what is going on in tegrastats that keeps the GPU up when the railgate_delay time is longer than the tegrastats polling interval.