GPU needs to be "warmed up" to achieve maximum performance

Hello,

Jetpack 7.0
L4T: 38.2.2
jetson_release output:

We ran GPU performance tests on the Thor AGX devkit and noticed that the results are highly dependent on GPU power consumption.
When the tests are run for the first time, the performance is “poor”. Tegrastats shows VDD_GPU 64497mW/64347mW.
However, after a while of running tests, GPU power consumption suddenly increases at about 10W (VDD_GPU 64497mW/64347mW → VDD_GPU 74692mW/64693mW) and the performance becomes “good”.

Our tests don’t configure anything in OS, just do heavy-load operations on GPU.
We tried run jetson_clocks, disable DVFS and all other “knobs” described in guide Jetson Thor Product Family — NVIDIA Jetson Linux Developer Guide , but it doesn’t help.

After digging deeper in issue we have learned to reproduce an increase in GPU power consumption and to compare performance before and after increasing.
We use two applications:

  • benchmark - a GPU performance testing application
  • warmupapp - GPU warm-up application, continuously performs heavy-load operations on GPU.

Steps to reproduce:

  • run benchmark and tegrastats. Look at VDD_GPU of tegrastats.
    current/average values of VDD_GPU are almost the same (VDD_GPU 64497mW/64347mW).
    Benchmark finishes with poor results.
  • run warmupapp. Look at VDD_GPU of tegrastats. Wait 20-60sec until VDD_GPU current value increased (VDD_GPU 64497mW/64347mW → VDD_GPU 74692mW/64693mW).
  • stop warmupapp and run bechmark.
    Benchmark finishes with good results.
  • Wait at about 5min.
  • run benchmark. Look at VDD_GPU of tegrastats.
    current/average values of VDD_GPU are almost the same again (VDD_GPU 64497mW/64347mW). No increases in GPU power consumption.
    Benchmark finishes with poor results.

So GPU performance increased after warm up, and decreased again after some time if no operations are performed on GPU.

How to make GPU work on maximum performance without warming up?

If you are not aware of this issue we are ready to provide you more details.

1 Like

Are you aware of the “nvpmodel” program? Jetsons can be set to run in a number of “power models”. Some are for conserving power, and the “maxn” model is to force max on all things (including heat and power consumption). Check out “nvpmodel --help”.

When the power model is avoiding over consumption of power it can reduce clock speeds on purpose. However, you also have to consider time it takes to fill RAM and perhaps CPU cache. Even for a purely CPU program any core with cache will run slower if the cache is a miss instead of a hit; almost by definition the first run must be a cache miss.

I’m not sure how the cache might affect GPU, but if the CPU is involved in filling RAM for the GPU, then cache would matter. Even without cache a direct fill of some RAM (the Jetson GPU is an integrated iGPU, while most desktop PCs use a discrete dGPU) takes time. Once RAM is set up the real work can start.

I suspect that what you are seeing is normal behavior; for this to be something different it would have to be realtime hardware and not buffer or cache. I’m guessing a system such as a desktop PC is much faster at filling that RAM and cache, and so it might not be as noticeable for the first run.

Of course, MAXN NV Power mode was set during the tests (see jetson_release output screenshot above). We also tried 120W mode.

Maybe you could explain why GPU power consumption increases sharply (by 10W) 20-60 seconds after warmupapp is started? And why then performance of GPU also increases despite of clocks throttling?

The main point is GPU performance is 100% dependent on GPU consumption.
If GPU consumption doesn’t increase, performance will never increase. Benchmark results will always poor, despite of how often and how many times benchmark runs.

1 Like

It isn’t warming up, it is distributing data. The CPU uses power to charge a “1”, and creates heat when it shorts the “1” and becomes a “0”. The continuous operation with a higher average alteration between “1” and “0” is what you see as power consumption. It does take power to load data into RAM and into cache, but it is trivial compared to computing with that data. It isn’t until you get cache hits and operations which change “1” and “0” back and forth that you get either the perception of “progress” or greater CPU and GPU power use.

If this were hard realtime hardware, then in order to cause deterministic behavior you would be without cache. The total computation would go down, but the latency would then be guaranteed. You would see reaching full power very quickly. There would be no cache fills, no cache misses (and accompanying fills). The GPU cannot perform until programming is distributed and data is ready.

This isn’t to say that what you’re seeing couldn’t use optimizing. There is every chance that as NVIDIA improves the software working with the GPU (and CPU) can begin faster, but unless it is designed for hard realtime, then it will never be without a startup lag.

Also, if you have two or more processes using a CPU core, or perhaps GPU, and if there is some sort of cache, then one process invalidating the other process’s cache adds to this. You would also have to be certain that the data from unrelated processes does not interfere. This is harder than it sounds because even though you don’t see it there are disk drivers, network drivers, filesystem type drivers, so on. I’m not sure yet on Thor if it is the same as the previous Jetson generations, but for those most hardware interrupts go only to CPU core 0.

Software interrupts run on any core, and are mostly on timers to tell the scheduler they need attention. Hardware interrupts must have a physical wire for their IRQ to trigger on a given core. If that wire goes only to CPU 0, then your hardware IRQs will always compete on that core, and thus also cause more cache invalidation/misses. On a desktop PC there is a mechanism to program which core has its hardware IRQ tied to which hardware. On an Intel it is the Asynchronous Programmable Interrupt Controller (I/O-APIC), and AMD has its way of doing something like this. Unless Thor has changed, then most hardware IRQs only route to CPU 0.

I’m not sure of the internal structure of the GPU, but it does change and improve with each generation. There must be something similar in the GPU for its computations to work in programmable blocks, but eventually it has to talk to either the memory controller or the CPU.

A hardware IRQ can be told to route to a non-supported CPU core, but when it is time to run the scheduler will reschedule on a core which can handle it. I really wish I knew more about how Jetsons handle hardware interrupts now versus in older systems. Some parts, such as the AON complex for GPIO can be reassigned, but only in blocks, it isn’t as fine-grained as an I/O-APIC.

To really know what goes on you’d likely have to profile your application and find out which calls are taking time during this initial startup, versus once it has been running (what you’ve called “warmup”). If this shows time in certain operations to the GPU, then it means you can reproduce this, and that in turn means NVIDIA could look at it and see if there is optimization available. Perhaps it is just distributing data, in which case the drivers don’t have any optimization for this, but it would also imply that your own data my be changed to get a faster start.

You might want to create a way of reproducing this and make it available to NVIDIA, and perhaps profiling it and finding out where most time is spent.

1 Like

You might want to create a way of reproducing this and make it available to NVIDIA, and perhaps profiling it and finding out where most time is spent.

For now, we would like to know if Nvidia is aware of this issue.
Maybe it’s already been solved in JP7.1, and we just need to wait for the release instead of spending time researching the issue.
If Nvidia is not aware of this issue, we are ready to continue investigations and will consider to provide reproducible example.

Could an Nvidia representative confirm whether they are aware of this issue or not?

I’m not sure which person looks at optimizations, but I would bet @WayneWWW could pass it along. Still, it would be nice if you could provide some sample code for them to reproduce this with. Or is this entirely from your benchmark app? If so, then maybe some benchmark setup information.

Hi,
Do you execute $ sudo jetson_clocks before running the tests? This command fixes GPU engine at maximum clock.

Hi,
Yes, we executed this command, but it did not help.

Hi,

Updating to JP7.1 did not help too. I will come back soon with more details and with sample.

Hi,
We suggest 1-second warm up while doing profiling:

VPI - Vision Programming Interface: Performance Benchmark
3. One second warm-up time running the algorithm in a loop.

Do you see longer warm up time in your use-case?

Hi,
Yes, we see much longer warm up time (minutes, not seconds).

Please find below the zip archive with:

  • max_perf.sh - runs jetson_clocks and other “knobs” to achieve max performance
  • thor_warmup.cu - source code to warm Thor up.
  • thor_warmup_tool - prebuilt tool from thor_warmup.cu

You could run the prebuilt thor_warmup_tool or build it by yourself from thor_warmup.cu.

Steps to reproduce:

  • run: $ sudo ./max_perf.sh
  • run: $ sudo tegrastats
  • in a new terminal run thor_warmup_tool: $ sudo ./thor_warmup_tool
  • Wait until [WARMED UP!!!]message appears. It usually takes ~1-10 minutes:
    Time: 134 s. Bandwidth: 237.312817 GB/s [NOT WARMED UP…]
    Time: 134 s. Bandwidth: 237.348936 GB/s [NOT WARMED UP…]
    Time: 134 s. Bandwidth: 237.333123 GB/s [NOT WARMED UP…]
    Time: 134 s. Bandwidth: 237.331169 GB/s [NOT WARMED UP…]
    Time: 134 s. Bandwidth: 237.286420 GB/s [NOT WARMED UP…]
    Time: 134 s. Bandwidth: 237.322987 GB/s [NOT WARMED UP…]
    Time: 135 s. Bandwidth: 252.359319 GB/s [NOT WARMED UP…]
    Time: 135 s. Bandwidth: 260.395361 GB/s [WARMED UP!!!]
    Time: 135 s. Bandwidth: 261.489439 GB/s [WARMED UP!!!]
    Time: 135 s. Bandwidth: 261.530598 GB/s [WARMED UP!!!]
    Time: 135 s. Bandwidth: 261.509800 GB/s [WARMED UP!!!]
    Time: 135 s. Bandwidth: 260.626844 GB/s [WARMED UP!!!]
    Time: 135 s. Bandwidth: 261.434510 GB/s [WARMED UP!!!]
    Time: 135 s. Bandwidth: 261.485083 GB/s [WARMED UP!!!]
  • You will see increase in bandwidth to ~260 GB/s. You will also see increase in VDD_GPU power consumption and EMC_FREQ utilization in tegrastats.
  • Press CTRL+C to stop thor_warmup_tool and run it again as fast as possible
  • warmup process will take ~5 seconds.

nvidia.zip (294.4 KB)

Hi,
Thanks. We will test on AGX Thor developer kit.

1 Like

Hi,
We can observe the issue by running your sample application on developer kit. Is checking with our teams and will update.

1 Like

Hi,
How is it going? Do you have any workarounds to try?

Hi,
We are still checking it. Will update once there is further finding. Thanks,

Hi,
JFYI, We have noticed that the memory bandwidth depends on the GPU temperature.
When the GPU temperature reaches 50 degrees, the memory bandwidth increases to its maximum value.
We modified /etc/nvfancontrol.conf to use passive cooling up to 50 degrees and put device in warm place.
The warm-up time has been significantly reduced.

We are looking forward to receiving the fixes.

I would tend to think that actually using more bandwidth causes a higher temperature, and not the other way around. If it truly does increase speed as a result of heating up, then I would think the power mode software has a bug.

I’m having the same issue. In my setup I’m getting the following stats (idle, slow, fast modes):
03-10-2026 15:27:51 RAM 19854/125772MB (lfb 522x4MB) CPU [0%@2601,0%@2601,3%@2601,0%@2601,0%@2601,0%@2601,0%@2601,0%@2601,0%@2601,0%@2601,0%@2601,0%@2601,0%@2601,0%@2601] EMC_FREQ 3%@4266 GR3D_FREQ @[1574,1574,1574] NVENC0_FREQ @1691 NVENC1_FREQ @1691 NVDEC0_FREQ @1691 NVDEC1_FREQ @1691 NVJPG0_FREQ @1691 VIC off OFA_FREQ @1691 PVA0_FREQ off APE 300 cpu@51.062C tj@52.781C soc012@51.375C gpu@52.781C soc345@50.531C VDD_GPU 4763mW/4763mW VDD_CPU_SOC_MSS 7939mW/8204mW VIN_SYS_5V0 5902mW/5936mW VIN 25916mW/25730mW
03-10-2026 15:47:17 RAM 19834/125772MB (lfb 523x4MB) CPU [1%@2601,3%@2601,14%@2601,1%@2601,3%@2601,5%@2601,3%@2601,1%@2601,1%@2601,14%@2601,9%@2601,5%@2601,3%@2601,3%@2601] EMC_FREQ 45%@4266 GR3D_FREQ @[1574,1573,1571] NVENC0_FREQ @1691 NVENC1_FREQ @1691 NVDEC0_FREQ @1691 NVDEC1_FREQ @1691 NVJPG0_FREQ @1691 VIC off OFA_FREQ @1691 PVA0_FREQ off APE 300 cpu@51.5C tj@54.531C soc012@50.968C gpu@54.531C soc345@52.937C VDD_GPU 32921mW/23813mW VDD_CPU_SOC_MSS 15078mW/12681mW VIN_SYS_5V0 14175mW/11530mW VIN 87710mW/58422mW
03-10-2026 15:49:59 RAM 19860/125772MB (lfb 523x4MB) CPU [1%@2601,3%@2601,0%@2601,1%@2601,1%@2601,1%@2601,1%@2601,1%@2601,1%@2601,1%@2601,0%@2601,3%@2601,3%@2601,17%@2601] EMC_FREQ 62%@4266 GR3D_FREQ @[1574,1571,1571] NVENC0_FREQ @1691 NVENC1_FREQ @1691 NVDEC0_FREQ @1691 NVDEC1_FREQ @1691 NVJPG0_FREQ @1691 VIC off OFA_FREQ @1691 PVA0_FREQ off APE 300 cpu@53.156C tj@56.25C soc012@51.781C gpu@56.25C soc345@54.968C VDD_GPU 42043mW/41779mW VDD_CPU_SOC_MSS 17452mW/17452mW VIN_SYS_5V0 15691mW/15691mW VIN 82168mW/78593mW

The warmup time depends, but usually takes 1-3 minutes on the cold run, and about 5 seconds after it was recently (< 1min ago) warmed up. Also, sometimes it keeps flipping from fast to slow and back every 5 seconds or so. But the change is always descrete (binary), it never switches into “medium” performance mode. Only fast or slow.

Another interesting observation. I tweaked the fan to start cooling when temp > 60℃ but with 90% rpm, so a sharp jump. I left the Thor at the evening with this setting and when I came back in the morning, I checked the temp and it was about 58℃ and, most importantly, the Thor responded in a fast mode from the very first request. No warm up was necessary.

This likely excludes any kind of “cache” optimizations hypothesys discussed earlier in the thread as the Thor had zero requests during the night.

This is sounding more like a bug in power modes.