I am porting a small C++ application service that warns the user in a web GUI about throttling events occurring in the Edge device.
In TX2 NX the throttling events could be detected by reading the file: /sys/class/thermal/thermal_zone2/cdev6/userspace_alert/thermal_alert
The application had no difficulty opening the file because its permissions were readable by everyone.
Also, I could simulate a throttling event by artificially lowering the trip point below the current measured GPU temperature, below a sequence of the test:
The monitored temperature is (valid for TX2 units): $ cat "/sys/class/thermal/thermal_zone2/temp" 55000
Verify the current trip point for GPU overheating:
$ cat /sys/class/thermal/thermal_zone1/trip_point_1_temp
99000
$ echo "50000" | sudo tee /sys/class/thermal/thermal_zone1/trip_point_1_temp
[sudo] password for intelliview:
50000
$ cat /sys/class/thermal/thermal_zone1/trip_point_1_temp
50000
No throttling happens from the point of view of the script, which I have running on another terminal.
I can also do: cat /sys/kernel/debug/bpmp/debug/soctherm/gpu_throt_status
And I get a 0.
*** Please note that this reply is generated by LLM automatically ***
To address your question, I’ll provide a comprehensive answer that compares the thermal management mechanisms between the TX2 NX and Orin NX platforms.
Thermal Management on TX2 NX
On the TX2 NX platform, you were able to detect throttling events by reading a file that reported the thermal status. You could also simulate a throttling event by artificially lowering the trip point below the current measured GPU temperature. The monitored temperature was valid for TX2 units, and you could modify the trip point to trigger an alert.
Thermal Management on Orin NX
On the Orin NX platform, you’re having difficulty finding the file that reports the thermal status. When you run the script and lower the GPU trip temperature, no throttling happens from the point of view of the script. You also tried running cat /sys/class/thermal/thermal_zone0/trip_points and got a 0.
The issue here is that the Orin NX platform uses a different thermal management mechanism than the TX2 NX. The Orin NX platform uses the BPMP Thermal Framework to manage thermal zones, power states, and sensing. This framework provides a more comprehensive and flexible way to manage thermal events.
Changing the Trip Point for GPU Throttling on Orin NX
To change the trip point for GPU throttling on the Orin NX platform, you’ll need to use the BPMP Thermal Framework APIs. Specifically, you’ll need to use the thermal_zone API to modify the trip point for the GPU thermal zone.
Here’s an example of how you can modify the trip point using the thermal_zone API:
This command sets the trip point for the GPU thermal zone to 80°C. You can adjust this value to trigger an alert at a different temperature.
Reusing Business Logic on Orin NX
While you can reuse the same business logic for detecting throttling events on the Orin NX platform, you’ll need to modify the logic to use the BPMP Thermal Framework APIs. You can use the thermal_zone API to read the thermal status and modify the trip point to trigger an alert.
Comparison of Thermal Management Mechanisms
Here’s a comparison of the thermal management mechanisms between the TX2 NX and Orin NX platforms:
TX2 NX
Orin NX
Thermal Management Mechanism
File-based thermal status reporting
BPMP Thermal Framework
Trip Point Modification
File-based modification
thermal_zone API
Thermal Zone Management
Not applicable
BPMP Thermal Framework
In summary, the Orin NX platform uses a different thermal management mechanism than the TX2 NX platform. To change the trip point for GPU throttling on the Orin NX platform, you’ll need to use the BPMP Thermal Framework APIs. You can reuse the same business logic for detecting throttling events, but you’ll need to modify the logic to use the BPMP Thermal Framework APIs.
*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***
Thanks, according to this answer, the temperature trip point for a control action on the GPU clock frequency would be on the thermal_zone_0, I know this is not accurate:
Zone 1 is the correct one, as I explained in the question. Moving on, the example also has the wrong temperature values; they must be in millidegrees Celsius, so add 1000 to the 80 and it could be a valid example.
Could anyone answer what path/file to monitor for the occurrence of the throttling event?
Is my guess correct? The reason I ask is that I have read the documentation, and it is not clear to me where this can be done:
By the way, we are on Tegra R35.4.1. Those pages are the closest I could get to my version.
The automated answer did not address this at all, or I missed it altogether. I used Warp AI agents at the beginning of this port; they were useful, but I need someone who can explain how to do this for real. Is that script even supposed to work on Orin NX? I ran some of the Python code at the command line to see the paths and those paths don’t exist in my Orin NX.
The browser UI is the front-end of a 24/7 industrial leak detection system based on dual cameras: thermal and visible light.
A user can view system events on a dedicated page. We publish the start and end times of any GPU throttling event there.
Your question if I am looking to track overcurrent (oc) events. I am unsure if slowing down the GPU clock qualifies as an oc event, just like turning a fan on certainly would. Can you confirm this?
I kindly remind you that in the TX2 there was a file where the GPU throttling state (active=1, inactive=0) could be read at any point in time. The file is /sys/class/thermal/thermal_zone2/cdev6/userspace_alert/thermal_alert, however, there is no equivalent in Orin NX.
The utility of knowing about GPU throttling becomes more apparent when other subsystems use it to adjust their responses for leak detection and system reliability assessments during those periods.
We are on JP 5.1.2 and L4T 35.4.1. The carrier board is a Connect Tech Inc. Photon NGX003 carrier board for Orin TX2 and Orin NX 16 GB.
Thanks for your suggestion to watch the monitor counts and throttling events, while simulating the GPU over temperature by lowering the trip point for the passive thermal control strategy. I did not mention in the initial posting. I followed the same testing steps described in that first posting. By that I mean lowering the threshold temperature for the passive cooling strategy to be used: echo "50000" | sudo tee /sys/class/thermal/thermal_zone1/trip_point_1_temp when the GPU temperature is around 52000 mC. I have always seen zero counts for all monitors. What I mean by this is that these values don’t move:
I see that NVIDIA has adopted a closer implementation of the ACPI thermal device interface in orin NX, which sounds great: ACPI_spec_6_4_Thermal_control. However, I am not familiar with the kernel command line interface: linux_kernel_5_10_thermal_sysfs-api.
Do I have to use it to set and manipulate the trip points for the passive cooling strategy for the GPU?
Is that what this section is outlining: linux-thermal-framework?
Can you provide the location of this TEGRA234_THERMAL_ZONE_GPU thermal zone definition in my Tegra installation? If I have to create this file, where do I save it? If NVIDIA already provides it, what is the file called?
I think if I can change the trip point there under gpu-hot-surface from 0x11170 (=70000mC) to a temperature lower than the current GPU thermal average, I will be better able to see changes in the hardware monitors via: # cat /sys/class/hwmon/hwmon1/oc*_throt_en
I assume that I will be able to work out the start and end of the throttling event if I sample the values of the overcurrent files oc*_throt_en and oc*_event_cnt quickly enough. Assuming that this oc = GPU slowing down.
@KevinFFF, are these values modifiable? If so, how do I change them? Could you please show me how to access them?
I need to test the logic of detecting the start and stop of the GPU throttling event.
Regarding tegrastats, the program’s output appears to be already processed input from system file readings. Are you suggesting parsing the output of that program? What variables would indicate that a hardware throttling event is happening? The only two variables that are not RAM or temperature readings are: EMC_FREQ and GR3D_FREQ. Do you suggest that I track them at the same sampling rate tegrastats samples them? If that is the case, why not sample the original files where those frequencies get recorded? What would a non-zero frequency mean? What threshold frequency value in the valid range would indicate that the software throttle has started/stopped?
I have explained how I tracked the GPU throttle events (lowering the frequency of the GPU clock) programmatically and tested by lowering the temperature threshold (trigger), in a Jetson TX2 NX. I assumed at the time that these events included software and hardware throttling.
This question is how a GPU (software or hardware) throttle event can be detected (start and stop) in a custom board integrating Orin NX programmatically using standard system files?
As I understand them, both software and hardware throttles are a reduction of the GPU clock frequency. The difference is the source of their trigger; in the software throttle, it comes from reading temperatures, and it is a corrective measure to prevent overheating due to an inertial heat imbalance (too high processing without power surges for too long a time, on a hot day out in the field?). In the hardware throttle, the trigger is an overcurrent event from batteries, regulators or inductors (in switches), and they may also lead to overheating.
I have reasoned that, regardless of the source, if I could monitor the actual GPU frequency, I could infer when throttling is happening, instead of monitoring temperatures and OC events.
I have been looking everywhere in the installation. Is this a good place to be monitoring? Are there more elegant ways of looking at this?
Attached is a text file in JSON format with the names of all the files and a sample of their values for my unit from the path:
Is 306 MHz the heavy throttling level, 87.5% of GPU max clock speed?
Similarly, would 612 MHz be no throttling, maybe 510 MHz be Light, and 408 MHz be Medium?
How to read these transitions?
Here I am watching the min, actual, and max frequencies in the GPU of that Orin NX unit. What transition would be a throttle event?
The linux kernel framework provides a node called emul_temp where you can overwrite the temperature to some values higher than throttling temperature. Alternatively, you can enable CONFIG_THERMAL_WRITABLE_TRIPS and re-compile the kernel to overwrite trip*point*temp.
You can refer to config_thermal_writable_trips - kernelconfig.io for details.
@KevinFFFemul_temp was the ticket to be able to simulate the throttling events using the temperature triggers to go from state 0→1 (SW throttling) and state 1→2 (HW throttling) as demoed in the video attached. I used the following to monitor the simulation on another terminal window:
I think this is a full solution. I can adapt the existing C++ business logic to monitor these events and record them in the database, and display them on the monitoring UI.
I ported the implementation of this Orin NX information to our application and here is what the events look like of simulated SW and HW throttling events. We don’t differentiate between them: