Activating GPU Power Rails on AGX Orin without a GUI

Hello,

I’m trying to run the GPU for testing purposes (I’m using gpu-burn for this task) but I have observed that in tegrastats GR3D_FREQ reaches 99% but the GPU and CV temps remain at -256C.

I looked this issue up on these forums and found answers like this which explain that the reason for the -256C reading is because the GPU is power gated and when not in use the power is shut off which means there is no reading from the temperature sensors. I see recommendations to use the “Jetson Power GUI” to turn on the GPU power.

One issue: I am accessing this Orin over SSH, and it has no monitor or keyboard directly attached. Is there a tool or way to modify the power management so that the GPU is turned on and I can see the temperature readings?

An additional side question: I don’t understand how the GPU can be “off” but GR3D_FREQ is at 99%. Does the power only disconnect the temperature sensors, not the GPU itself?

Thank you for any clarification or assistance on this issue!

Hi mason15,

Are you using the devkit or custom board for AGX Orin?
What’s your Jetpack version in use?

Please also share the result of “sudo tegrastats” for further check.

Hi Kevin!

Are you using the devkit or custom board for AGX Orin?

This is a custom board using the Orin Industrial SOM.

What’s your Jetpack version in use?

I ran:

dpkg-query --show nvidia-l4t-core

And got:

nvidia-l4t-core	35.5.0-20240219203809

Please also share the result of “sudo tegrastats” for further check.

05-10-2024 11:54:03 RAM 3274/54718MB (lfb 10932x4MB) SWAP 0/27359MB (cached 0MB) CPU [2%@729,0%@729,0%@729,6%@729,0%@729,0%@729,0%@729,0%@729,0%@1497,0%@1497,0%@1497,13%@1497] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[0,0] VIC_FREQ 921 APE 174 CV0@-256C CPU@52.656C Tboard@42C SOC2@49.125C Tdiode@42.5C SOC0@50.687C CV1@-256C GPU@-256C tj@52.562C SOC1@50.312C CV2@-256C VDD_GPU_SOC 2154mW/2154mW VDD_CPU_CV 718mW/718mW VIN_SYS_5V0 7862mW/7862mW NC 0mW/0mW VDDQ_VDD2_1V8AO 796mW/796mW NC 0mW/0mW

I’ve tried to verify it locally.
It seems matrixMul runs too fast so that tegrastats can’t sample it.
Please try using while 1 to run it in loop.
I could see both frequency for GR3D_FREQ and temperature for GPU up.

Apologies but I am a little bit lost. When you say to use “while 1” to run “it” in a loop, what is it that I should be running in a loop? gpu-burn? Or some other tool?

I tried to run matrixMul in cuda-sample to check if there’re the values for GR3D_FREQ and GPU temperature.
If I just run matrixMul once, I can’t get the values and it may be caused from it runs too fast and tegrastats can’t sample it.
So, I write a script to run it in an infinite loop. And I get the expected results.

Please also try to verify with cuda-sample.

Ah ok. I was using gpu-burn not matrixMul, but regardless I tried running a loop of matrixMul and the GPU temperature did show up, so I’m wondering if the issue is with gpu-burn. Are you able to test and see if the GPU temperature doesn’t show up? In both cases, the GR3D_FREQ value increases.

I still don’t understand how GR3D_FREQ can show up as 99% (implying the GPU is under load and running) but the GPU temperature is at -256. This is very confusing.

Yes, it seems not the expected result to us.

Please share the steps how you run gpu-burn.

Here is the setup I have been using:

Ensure libcublas and g++ is available:

sudo apt install cuda-toolkit-11-4 g++ -y

Clone the repository and build gpu-burn:

git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
make

And then run the command:

./gpu_burn -m 40% 60

However, now I am seeing GPU temperatures appear in tegrastats. I am wondering if I was failing to run sudo for tegrastats and that is the cause? I thought I had done that before but perhaps not. That may be the whole solution to this issue 😓️.

yes, you should use sudo for tegrastats.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.