Constant GPU inference power


I am trying to make profile for the Jetson Nano components independently. While most of the components instant power may vary depending on the load (e.g. CPU power consumption depends on the % usage of it).

When modeling the GPU, I found out that the actual power consumption (reported by TegraStats utility) when making inferences is constant at a maximum. For the inference I use TensorRT framework. I tried with full models of 120 layers depth and with a single layer model, the power reported is still the same. Energy will change depending on the inference performance, but one would expect not to need the same amount of cores for such different loads, I may by wrong though. Is this behavior expected?

Thank you in advance

Best regards

Probably the loading is heavy in both cases. Please try this sample and check if yo see difference:


Jetson Linux API Reference: 02_video_dec_cuda
The sample contains simple CUDA code to consume little GPU loading. Please run it as a comparison.


Thanks for the response.

You were right, this encoding pipeline, partially occupies the GPU. Then, can I assume that inferring with any model size with TRT framework will occupy 100% of the GPU? Btw, I found this behavior in three boards: Nano, AGX and NX.

For a single layer model, maybe in the Nano board could make sense to need the whole GPU, but in the AGX seems kind of overkill.


You may check if it is specific to the model. We have ResNet10 model in DeepStream SDK. You can try the command on Jetson platforms for comparison:

/opt/nvidia/deepstream/deepstream-5.1/samples/configs/deepstream-app$ deepstream-app -c source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_nano.txt


You are right. I can see now non-static GPU usage.

Do you have any idea why it may be using the full GPU with a single ReLu layer model? The input is 224x224x3. As well with a MobilenetV1 model with the same input.

Also, I am re-using the python TensorRT framework examples provided as base.



Do you use pure TensorRT python API or the integrated version in TensorFlow or PyTorch.

For pure TensorRT API, all the inference jobs can be submitted by the enqueue function.
If the queue is never empty, GPU will keep inferencing it and the loading is always maximum.
But if you submit an input data periodically, ex. every 33 ms, GPU load will decrease when the period that queue is empty.



I use TensorRT python API.

I will try it with the Queue system.

On the other hand, I have a question regarding the GPU core usage, is there a way of knowing/checking which cores are being used at any moment in time? I am assuming that TensorRT is intended to use tensor cores due to its low power consumption and high performance, but I may be wrong.

Thanks for you help.


This is decided by the the GPU scheduler.
If Tensor core is idle and the task is supported, it is expected to use Tensor Core.

You can use profiler to get the usage information.
Please check below command for more information :