Can't detect the dGPU utilization through tegrastat

Hi,

Right now I’m working on the PX2 platform. I would like to confirm the GPU runtime utilization. From this forums, I know that I can use “sudo tegrastat” to check the GPU utlization. The “GR3D_FREQ 0%@1275” means the iGPU, and the “GR3D_PCI 0%@2” means the dGPU. But when I measured the GPU utilization by using tegrastate, I could only see the percentage of iGPU. The percentage of dGPU was always 0. Was there something wrong in my test scenario? The followings are the command I used to test.

  1. download NVIDIA_CUDA-9.0_Samples
  2. test iGPU
  • run "sudo tegrastats" in one terminal
  • run ./matrixMul -device=1 in the /NVIDIA_CUDA-9.0_Samples/0_Simple/matrixMul folder in another terminal
  • get the result "RAM 1456/6668MB (lfb 1079x4MB) CPU [0%@1981,61%@2031,42%@2033,0%@1980,0%@1979,1%@1981] EMC_FREQ 13%@1600 GR3D_FREQ 99%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@45C MCPU@45C Tegra@0C Tdiode@50.25C AO@45C GPU@51C BCPU@45C thermal@50.25C Tegra@50.25C Tj@50.25C"
  1. test dGPU
  • run "sudo tegrastats" in one terminal
  • run ./matrixMul -device=0 in the /NVIDIA_CUDA-9.0_Samples/0_Simple/matrixMul folder in another terminal
  • get the result "RAM 1460/6668MB (lfb 1080x4MB) CPU [0%@1982,4%@2031,98%@2030,0%@1981,0%@1980,1%@1980] EMC_FREQ 2%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@43C MCPU@43C Tegra@0C Tdiode@46C AO@41C GPU@49C BCPU@43C thermal@46C Tegra@46C Tj@46C"

Please help. Thanks!

Dear chenghul,
We are looking into this issue and get back to you.

Attached the new version tegrastats here.
Remove the .txt to run it.
tegrastats.txt (66 KB)

Hi ShaneCCC,

Thanks for your help. But the result is the same when I try to use the same command to test. Is there anything I missed? Or could you provide the way you test?

Thanks,
Krammer

Did you launch tegrastats by supervisor mode?
Please try sudo ./tegrastats

Yes, I launched with sudo.

Hi ShaneCCC,

when i used yours and previous version for comparing both, dGPU(id=0) was processing something by caffe.
The results are as follows.

Your tegrastats
RAM 2210/6668MB (lfb 773x4MB) CPU [0%@1997,23%@2035,77%@2034,0%@1996,0%@1996,0%@1995] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2573 PLL@43.5C MCPU@43.5C Tegra@0C Tdiode@48.5C AO@43.5C GPU@49.5C BCPU@43.5C thermal@48.5C Tegra@48.5C Tj@48.5C
RAM 2210/6668MB (lfb 773x4MB) CPU [0%@1970,80%@2035,20%@2035,0%@1964,0%@1965,1%@1964] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2581 PLL@44C MCPU@44C Tegra@0C Tdiode@48.25C AO@43.5C GPU@49.5C BCPU@43.5C thermal@48.5C Tegra@48.25C Tj@48.25C
RAM 2210/6668MB (lfb 773x4MB) CPU [0%@1950,65%@2015,34%@2018,0%@1947,0%@1948,0%@1949] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2573 PLL@43.5C MCPU@43.5C Tegra@0C Tdiode@48.5C AO@43C GPU@49.5C BCPU@43.5C thermal@48.5C Tegra@48.5C Tj@48.5C

Previous tegrastats
RAM 2213/6668MB (lfb 773x4MB) CPU [0%@1997,0%@2035,0%@2034,0%@1996,0%@1996,0%@1995] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@42.5C MCPU@42.5C Tegra@0C Tdiode@47.25C AO@42.5C GPU@48.5C BCPU@42.5C thermal@47.75C Tegra@47.25C Tj@47.25C
RAM 2214/6668MB (lfb 773x4MB) CPU [0%@1998,49%@2034,51%@2035,0%@1996,0%@1997,0%@1996] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 1% GR3D_PCI 0%@2 PLL@42.5C MCPU@42.5C Tegra@0C Tdiode@47.25C AO@42.5C GPU@48.5C BCPU@42.5C thermal@47.25C Tegra@47.25C Tj@47.25C
RAM 2213/6668MB (lfb 773x4MB) CPU [0%@1966,80%@2034,20%@2035,0%@1966,0%@1964,0%@1968] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@42.5C MCPU@42.5C Tegra@0C Tdiode@47C AO@42.5C GPU@48.5C BCPU@42.5C thermal@47.25C Tegra@47C Tj@47C

It’ just different to showing that dGPU memort clock.
am I right?

Actually, I need to check gpu memory uasge like nvidia-smi.

Hi,

You can check CUDA memory with cudaMemGetInfo().
Here is an example for your reference:
https://devtalk.nvidia.com/default/topic/1013464/jetson-tx2/gpu-out-of-memory-when-the-total-ram-usage-is-2-8g/post/5168834/#5168834

Thanks.

Thanks AastaLLL so much.

I’ve done a memory test based on the code that you gave me.
The results that by your code and by nvidia-smi are the same on windows(VS2015, x64) and ubuntu(GPU server, 14.04, x64).

that code seems to work well too for Drive PX2’s GPU 0.
Becuse when I tested the same deeplearning program on the GPU server and DrivePX2 pascal GPU(GPU 0), the GPU memory usage was almost the same(±50MB).

However, the results of the code are different for the Paker GPU(GPU 1 on drive PX2) “sudo tegrastats” in DrivePX2(DriveInstall_5.0.5.0bL_SDK_b3) and Jetson TX2 (JetPack 3.1).

[Your code in Drive PX2(GPU 1, Paker)]
GPU memory usage: used = 2889.60 MB, free = 3777.98 MB, total = 6667.57 MB

[sudo tegrastats in Drive PX2(GPU 1, Paker)]
RAM 1783/6668MB (lfb 730x4MB) CPU [1%@1965,0%@2034,0%@2036,0%@1965,0%@1964,0%@1964] EMC_FREQ 0%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@46C MCPU@46C Tegra@0C Tdiode@50.75C AO@46C GPU@52C BCPU@46C thermal@51.5C Tegra@50.75C Tj@50.75C

Could you tell me why?

Is there a good way to actually know the percentage of iGPU and dGPU computing usage (like nvidia-settings or nvidia-smi on the x86_64) ?

Hi nunovxax9,

I needed that function too, so I had tried several things.
From the conclusion, it is impossible now.

Using the nvmlDeviceGetUtilizationRates() of the NVML API, you can get the GPU Utilization rate at the code level, but According to the link below, that is not available in the Tegra series.
https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference
(The output of nvmlDeviceGetUtilizationRates() is the same as “Volatile GPU-Util” of “nvidia-smi”.)

There is no way to check the utilization rate of dGPU untill supporting it by nvidia developers.
However, iGPU is checked by the value of GR3D_FREQ when you use “sudo tegrastats”.

Hi,

For dGPU, you can get the current clock information via

sudo cat /sys/kernel/debug/gpu_pci/clocks/gpc2clk

And the gpu utilization percentage via

cat /sys/bus/pci/drivers/nvgpu/[dynamic ID]/load

Currently, there is something incorrect in the ‘load’ node and it always report 0.
We are checking this with core team internally. Will update information with you later.

Thanks.

Thank you AastaLLL,
I hope that this issue will be resolved soon.
Have a good day.

Hi,
have you solved that problem and is there solution for python to monitor the gpu load dynamically?
Thank you!

Hi,

Thanks for your patience.
We are still working on this issue.

Will update information once we have further information.
Thanks.

There is no way to check the utilization rate of dGPU untill supporting it by nvidia developers.
However, iGPU is checked by the value of GR3D_FREQ when you use “sudo tegrastats”
FetLife IMVU Canva

Any update, please? I tried running matmul of large matrices through TensorFlow and I get 0% load on both:

Tegrastats shows 99% when running the same code on iGPU, so there’s definitely something wrong with reporting the load from dGPU

Fix for the tegrastats issue will be included in the upcoming drive os release. Thanks!

Fix for the tegrastats issue will be included in the upcoming drive os release. Thanks!

That’s a great news, thanks! Any ETA when this going to be released?

Can one update just tegrastats (and necessary dependencies, if any) on existing Drive PX 2 installation? I’m not keen on reflashing the device again, we have all the tools and environment set up there and it takes quite some time to rebuild it :(

Are you going even to release new Drive OS for Drive PX 2?