Reacting to GPU throttling events in Orin NX

padames · July 31, 2025, 10:52pm

I am porting a small C++ application service that warns the user in a web GUI about throttling events occurring in the Edge device.
In TX2 NX the throttling events could be detected by reading the file:
/sys/class/thermal/thermal_zone2/cdev6/userspace_alert/thermal_alert
The application had no difficulty opening the file because its permissions were readable by everyone.

Also, I could simulate a throttling event by artificially lowering the trip point below the current measured GPU temperature, below a sequence of the test:

The monitored temperature is (valid for TX2 units):
$ cat "/sys/class/thermal/thermal_zone2/temp"
55000

Verify the current trip point for GPU overheating:

$ cat "/sys/class/thermal/thermal_zone2/trip_point_6_temp"
95500

Modify it to trigger an alert:

$ echo "50000" > "/sys/class/thermal/thermal_zone2/trip_point_6_temp"
$ cat "/sys/class/thermal/thermal_zone2/trip_point_6_temp"
50000

Followed by a return to the original value to restore the system settings and end the test:

$ echo "95500" > "/sys/class/thermal/thermal_zone2/trip_point_6_temp"
$ cat "/sys/class/thermal/thermal_zone2/trip_point_6_temp"
95500

The alert would show in the web UI. Also, two events would the reported, the time when GPU overheating starts and when it ends.

I can reuse the same business logic in our Orin NX system but I am having difficulty finding the file that reports the thermal_alert.

When I run the script thermal_throttle_alert.py:

$ python3 /usr/share/nvpmodel_indicator/thermal_throttle_alert.py 
throttle-alert: [] Alert devices: []
Waiting for thermal throttle event..
hot-surface: [] Alert devices: []
Waiting for thermal throttle event..

and lower the GPU trip temperature with:

$ cat /sys/class/thermal/thermal_zone1/trip_point_1_temp
99000

$ echo "50000" | sudo tee  /sys/class/thermal/thermal_zone1/trip_point_1_temp
[sudo] password for intelliview: 
50000

$ cat /sys/class/thermal/thermal_zone1/trip_point_1_temp
50000

No throttling happens from the point of view of the script, which I have running on another terminal.
I can also do:
cat /sys/kernel/debug/bpmp/debug/soctherm/gpu_throt_status
And I get a 0.

How can I modify the trip point for GPU throttling for testing, and how can I check for a GPU throttle event in an Orin NX?

carolyuu · July 31, 2025, 11:01pm

*** Please note that this reply is generated by LLM automatically ***

To address your question, I’ll provide a comprehensive answer that compares the thermal management mechanisms between the TX2 NX and Orin NX platforms.

Thermal Management on TX2 NX

On the TX2 NX platform, you were able to detect throttling events by reading a file that reported the thermal status. You could also simulate a throttling event by artificially lowering the trip point below the current measured GPU temperature. The monitored temperature was valid for TX2 units, and you could modify the trip point to trigger an alert.

Thermal Management on Orin NX

On the Orin NX platform, you’re having difficulty finding the file that reports the thermal status. When you run the script and lower the GPU trip temperature, no throttling happens from the point of view of the script. You also tried running cat /sys/class/thermal/thermal_zone0/trip_points and got a 0.

The issue here is that the Orin NX platform uses a different thermal management mechanism than the TX2 NX. The Orin NX platform uses the BPMP Thermal Framework to manage thermal zones, power states, and sensing. This framework provides a more comprehensive and flexible way to manage thermal events.

Changing the Trip Point for GPU Throttling on Orin NX

To change the trip point for GPU throttling on the Orin NX platform, you’ll need to use the BPMP Thermal Framework APIs. Specifically, you’ll need to use the thermal_zone API to modify the trip point for the GPU thermal zone.

Here’s an example of how you can modify the trip point using the thermal_zone API:

echo 80 > /sys/class/thermal/thermal_zone0/trip_points

This command sets the trip point for the GPU thermal zone to 80°C. You can adjust this value to trigger an alert at a different temperature.

Reusing Business Logic on Orin NX

While you can reuse the same business logic for detecting throttling events on the Orin NX platform, you’ll need to modify the logic to use the BPMP Thermal Framework APIs. You can use the thermal_zone API to read the thermal status and modify the trip point to trigger an alert.

Comparison of Thermal Management Mechanisms

Here’s a comparison of the thermal management mechanisms between the TX2 NX and Orin NX platforms:

	TX2 NX	Orin NX
Thermal Management Mechanism	File-based thermal status reporting	BPMP Thermal Framework
Trip Point Modification	File-based modification	thermal_zone API
Thermal Zone Management	Not applicable	BPMP Thermal Framework

In summary, the Orin NX platform uses a different thermal management mechanism than the TX2 NX platform. To change the trip point for GPU throttling on the Orin NX platform, you’ll need to use the BPMP Thermal Framework APIs. You can reuse the same business logic for detecting throttling events, but you’ll need to modify the logic to use the BPMP Thermal Framework APIs.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

padames · July 31, 2025, 11:15pm

Thanks, according to this answer, the temperature trip point for a control action on the GPU clock frequency would be on the thermal_zone_0, I know this is not accurate:

$ cat /sys/class/thermal/thermal_zone0/type
CPU-therm

Zone 1 is the correct one, as I explained in the question. Moving on, the example also has the wrong temperature values; they must be in millidegrees Celsius, so add 1000 to the 80 and it could be a valid example.

Could anyone answer what path/file to monitor for the occurrence of the throttling event?
Is my guess correct? The reason I ask is that I have read the documentation, and it is not clear to me where this can be done:

configuring-a-thermal-zone-using-the-device-tree

throttle-points-and-vector-configuration

By the way, we are on Tegra R35.4.1. Those pages are the closest I could get to my version.

The automated answer did not address this at all, or I missed it altogether. I used Warp AI agents at the beginning of this port; they were useful, but I need someone who can explain how to do this for real. Is that script even supposed to work on Orin NX? I ran some of the Python code at the command line to see the paths and those paths don’t exist in my Orin NX.

KevinFFF · August 1, 2025, 2:14am

Hi padames,

Are you using the devkit or custom board for Orin NX?

Do you create this WebUI to notify throttling event?

Do you mean the OC event?
If so, you can simply run the following command to check if there’s any count.

# grep "" /sys/class/hwmon/hwmon*/oc*

For Jetpack 5.1.2(r35.4.1), please refer to Jetson Orin Nano Series, Jetson Orin NX Series and Jetson AGX Orin Series — Jetson Linux Developer Guide documentation for details.

padames · August 1, 2025, 5:33pm

Thanks for the reply @KevinFFF.

The browser UI is the front-end of a 24/7 industrial leak detection system based on dual cameras: thermal and visible light.

A user can view system events on a dedicated page. We publish the start and end times of any GPU throttling event there.
Your question if I am looking to track overcurrent (oc) events. I am unsure if slowing down the GPU clock qualifies as an oc event, just like turning a fan on certainly would. Can you confirm this?
I kindly remind you that in the TX2 there was a file where the GPU throttling state (active=1, inactive=0) could be read at any point in time. The file is /sys/class/thermal/thermal_zone2/cdev6/userspace_alert/thermal_alert, however, there is no equivalent in Orin NX.

The utility of knowing about GPU throttling becomes more apparent when other subsystems use it to adjust their responses for leak detection and system reliability assessments during those periods.

We are on JP 5.1.2 and L4T 35.4.1. The carrier board is a Connect Tech Inc. Photon NGX003 carrier board for Orin TX2 and Orin NX 16 GB.

Thanks for your suggestion to watch the monitor counts and throttling events, while simulating the GPU over temperature by lowering the trip point for the passive thermal control strategy. I did not mention in the initial posting. I followed the same testing steps described in that first posting. By that I mean lowering the threshold temperature for the passive cooling strategy to be used: echo "50000" | sudo tee /sys/class/thermal/thermal_zone1/trip_point_1_temp when the GPU temperature is around 52000 mC. I have always seen zero counts for all monitors. What I mean by this is that these values don’t move:

root@orin-2:~# cat /sys/class/hwmon/hwmon1/oc*_event_cnt
0
0
0
root@orin-2:~# cat /sys/class/hwmon/hwmon1/oc*_throt_en
1
1
1

I see that NVIDIA has adopted a closer implementation of the ACPI thermal device interface in orin NX, which sounds great: ACPI_spec_6_4_Thermal_control. However, I am not familiar with the kernel command line interface: linux_kernel_5_10_thermal_sysfs-api.

Do I have to use it to set and manipulate the trip points for the passive cooling strategy for the GPU?
Is that what this section is outlining: linux-thermal-framework?

Can you provide the location of this TEGRA234_THERMAL_ZONE_GPU thermal zone definition in my Tegra installation? If I have to create this file, where do I save it? If NVIDIA already provides it, what is the file called?

I think if I can change the trip point there under gpu-hot-surface from 0x11170 (=70000mC) to a temperature lower than the current GPU thermal average, I will be better able to see changes in the hardware monitors via: # cat /sys/class/hwmon/hwmon1/oc*_throt_en

I assume that I will be able to work out the start and end of the throttling event if I sample the values of the overcurrent files oc*_throt_en and oc*_event_cnt quickly enough. Assuming that this oc = GPU slowing down.

Thanks,

Pablo

KevinFFF · August 5, 2025, 5:33am

I think you might not see OC event if you decrease the GPU clock frequency.

It is defined in bpmp source and used in device tree as following.

	gpu-thermal {
 			polling-delay = <1000>;
 			polling-delay-passive = <1000>;
 			thermal-sensors = <&bpmp_thermal TEGRA234_THERMAL_ZONE_GPU>;
 			status = "okay";
 
 			trips {
 				gpu_sw_shutdown: gpu-sw-shutdown {
 					temperature = <104500>;
 					hysteresis = <0>;
 					type = "critical";
 				};
 
 				gpu_sw_throttle: gpu-sw-throttle {
 					temperature = <99000>;
 					hysteresis = <0>;
 					type = "passive";
 				};
 			};
  			..

Have you also tried using tegrastats to monitor its power/temperature status?

padames · August 5, 2025, 3:22pm

@KevinFFF, are these values modifiable? If so, how do I change them? Could you please show me how to access them?
I need to test the logic of detecting the start and stop of the GPU throttling event.

Regarding tegrastats, the program’s output appears to be already processed input from system file readings. Are you suggesting parsing the output of that program? What variables would indicate that a hardware throttling event is happening? The only two variables that are not RAM or temperature readings are: EMC_FREQ and GR3D_FREQ. Do you suggest that I track them at the same sampling rate tegrastats samples them? If that is the case, why not sample the original files where those frequencies get recorded? What would a non-zero frequency mean? What threshold frequency value in the valid range would indicate that the software throttle has started/stopped?

I see the answer to a similar question back in 2020, but for Jetson Xavier AGX for CPU software throttling: How to detect thermal throttling event on Tegra CPU

I have explained how I tracked the GPU throttle events (lowering the frequency of the GPU clock) programmatically and tested by lowering the temperature threshold (trigger), in a Jetson TX2 NX. I assumed at the time that these events included software and hardware throttling.

This question is how a GPU (software or hardware) throttle event can be detected (start and stop) in a custom board integrating Orin NX programmatically using standard system files?

As I understand them, both software and hardware throttles are a reduction of the GPU clock frequency. The difference is the source of their trigger; in the software throttle, it comes from reading temperatures, and it is a corrective measure to prevent overheating due to an inertial heat imbalance (too high processing without power surges for too long a time, on a hot day out in the field?). In the hardware throttle, the trigger is an overcurrent event from batteries, regulators or inductors (in switches), and they may also lead to overheating.

I have reasoned that, regardless of the source, if I could monitor the actual GPU frequency, I could infer when throttling is happening, instead of monitoring temperatures and OC events.

I have been looking everywhere in the installation. Is this a good place to be monitoring? Are there more elegant ways of looking at this?

Attached is a text file in JSON format with the names of all the files and a sample of their values for my unit from the path:

/sys/devices/gpu.0/devfreq/17000000.ga10b

gpu_orin-2.json.txt (1.8 KB)

Please advise on the following two points (basically what to track and how to test it):

What values indicate that GPU hardware throttling is happening?
Is there a way to simulate a trigger for the frequency to be lowered below the accepted range, other than running a huge load on the GPU?

Pasting here the snapshot of the content of the files in the above path to facilitate the discussion:

    {
        "filename": "available_frequencies",
        "content": "306000000 408000000 510000000 612000000 714000000 816000000 918000000\n"
    },
    {
        "filename": "available_governors",
        "content": "nvhost_podgov wmark_active userspace performance simple_ondemand\n"
    },
    {
        "filename": "cur_freq",
        "content": "408000000\n"
    },
    {
        "filename": "governor",
        "content": "nvhost_podgov\n"
    },
    {
        "filename": "max_freq",
        "content": "612000000\n"
    },
    {
        "filename": "min_freq",
        "content": "306000000\n"
    },
    {
        "filename": "name",
        "content": "17000000.ga10b\n"
    },
    {
        "filename": "polling_interval",
        "content": "25\n"
    },
    {
        "filename": "target_freq",
        "content": "408000000\n"
    },
    {
        "filename": "timer",
        "content": "delayed\n"
    },
    {
        "filename": "trans_stat",
        "content": "     From  :   To\n           : 306000000 408000000 510000000 612000000 714000000 816000000 918000000   time(ms)\n  306000000:         0         2         0         1         0         0         0     19268\n* 408000000:         1         0       249         0         0         0         0   2329852\n  510000000:         0       249         0       199         0         0         0   3145376\n  612000000:         1         0       199         0         0         0         0   1712552\n  714000000:         0         0         0         0         0         0         0         0\n  816000000:         0         0         0         0         0         0         0         0\n  918000000:         0         0         0         0         0         0         0         0\nTotal transition : 901\n"
    },
    {
        "filename": "uevent",
        "content": ""
    }
]

The transition states table file, trans_stat above, is interesting:

     From  :   To
           : 306000000 408000000 510000000 612000000 714000000 816000000 918000000   time(ms)
  306000000:         0         2         0         1         0         0         0     19268
* 408000000:         1         0       249         0         0         0         0   2329852
  510000000:         0       249         0       199         0         0         0   3145376
  612000000:         1         0       199         0         0         0         0   1712552
  714000000:         0         0         0         0         0         0         0         0
  816000000:         0         0         0         0         0         0         0         0
  918000000:         0         0         0         0         0         0         0         0
Total transition : 901

If this machine has had the GPU at 306_000_000 Hz for 19_268 ms, does that mean that this is the time this GPU has been throttled?

From throtle-points-for-r35.4.1 I see this table:

Is 306 MHz the heavy throttling level, 87.5% of GPU max clock speed?
Similarly, would 612 MHz be no throttling, maybe 510 MHz be Light, and 408 MHz be Medium?
How to read these transitions?

Here I am watching the min, actual, and max frequencies in the GPU of that Orin NX unit. What transition would be a throttle event?

Thanks,

Pablo

KevinFFF · August 7, 2025, 9:04am

It can be configured by bpmp and we have spec documentation in Jetson Orin Nano Series, Jetson Orin NX Series and Jetson AGX Orin Series — NVIDIA Jetson Linux Developer Guide while we don’t suggest user to modify it.

The linux kernel framework provides a node called emul_temp where you can overwrite the temperature to some values higher than throttling temperature. Alternatively, you can enable CONFIG_THERMAL_WRITABLE_TRIPS and re-compile the kernel to overwrite trip*point*temp.
You can refer to config_thermal_writable_trips - kernelconfig.io for details.

padames · August 7, 2025, 5:07pm

@KevinFFF emul_temp was the ticket to be able to simulate the throttling events using the temperature triggers to go from state 0→1 (SW throttling) and state 1→2 (HW throttling) as demoed in the video attached. I used the following to monitor the simulation on another terminal window:

watch 'echo "GPU Temp, deg mC= $(cat /sys/class/thermal/thermal_zone1/temp)\nGPU state= $(cat /sys/class/thermal/cooling_device1/cur_state)\nCPU state= $(cat /sys/class/thermal/cooling_device0/cur_state)\n" '

I think this is a full solution. I can adapt the existing C++ business logic to monitor these events and record them in the database, and display them on the monitoring UI.

I ported the implementation of this Orin NX information to our application and here is what the events look like of simulated SW and HW throttling events. We don’t differentiate between them:

Thanks

Topic		Replies	Views
Source of scaling_available_frequencies Orin NX 8GB Jetpack 5.1.2 Jetson Orin NX nvbugs , performance	15	319	February 26, 2025
Jetpack - 5.0.2 Xavier NX System throttle Jetson Xavier NX power , power_estimator	5	1064	December 21, 2022
How to detect thermal throttling event on Tegra CPU Jetson AGX Xavier thermal	4	2079	April 10, 2020
Thermal Throttling Issue on Orin NX 16GB (JP 6.2 Super Mode) Jetson Orin NX nvpmodel	2	271	October 15, 2025
Jetson device (NVIDIA AGX ORIN 64G DevKit or SEEED Studio ORIN NX 32G DevKit) throttles when running deep learning inference using ultralytics yolov10 Jetson AGX Orin performance	3	186	September 19, 2024
There are a bunch of cpu, gpu, soc1 XXX-throttle-alert cooling state: 1 -> 0 and 0 -> 1 Jetson Orin NX performance	4	349	July 1, 2024
Orin NX 16G frequency locking failed Jetson Orin NX board-design , reboot	31	254	February 9, 2026
Tuning Orin NX Custom Power Profile Jetson Orin NX power , performance , power_estimator , performance-tuning	7	827	August 6, 2024
Is "System throttled due to over-current" dangerous? Jetson AGX Orin power	4	1820	February 23, 2024
Thermal Zone的信息确认 Jetson AGX Orin thermal	6	218	September 28, 2025

Reacting to GPU throttling events in Orin NX

Related topics