Constant GPU inference power

ignacio.penasfernandez · April 26, 2021, 8:31am

Hello,

I am trying to make profile for the Jetson Nano components independently. While most of the components instant power may vary depending on the load (e.g. CPU power consumption depends on the % usage of it).

When modeling the GPU, I found out that the actual power consumption (reported by TegraStats utility) when making inferences is constant at a maximum. For the inference I use TensorRT framework. I tried with full models of 120 layers depth and with a single layer model, the power reported is still the same. Energy will change depending on the inference performance, but one would expect not to need the same amount of cores for such different loads, I may by wrong though. Is this behavior expected?

Thank you in advance

Best regards
Ignacio

DaneLLL · April 26, 2021, 10:59am

Hi,
Probably the loading is heavy in both cases. Please try this sample and check if yo see difference:

/usr/src/jetson_multimedia_api/samples/02_video_dec_cuda

Jetson Linux API Reference: Main Page | NVIDIA Docs
The sample contains simple CUDA code to consume little GPU loading. Please run it as a comparison.

ignacio.penasfernandez · April 26, 2021, 11:56am

Hello,

Thanks for the response.

You were right, this encoding pipeline, partially occupies the GPU. Then, can I assume that inferring with any model size with TRT framework will occupy 100% of the GPU? Btw, I found this behavior in three boards: Nano, AGX and NX.

For a single layer model, maybe in the Nano board could make sense to need the whole GPU, but in the AGX seems kind of overkill.

Bests
Ignacio

DaneLLL · April 27, 2021, 2:09am

Hi,
You may check if it is specific to the model. We have ResNet10 model in DeepStream SDK. You can try the command on Jetson platforms for comparison:

/opt/nvidia/deepstream/deepstream-5.1/samples/configs/deepstream-app$ deepstream-app -c source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_nano.txt

ignacio.penasfernandez · April 28, 2021, 10:21am

Hi,

You are right. I can see now non-static GPU usage.

Do you have any idea why it may be using the full GPU with a single ReLu layer model? The input is 224x224x3. As well with a MobilenetV1 model with the same input.

Also, I am re-using the python TensorRT framework examples provided as base.

Thanks

AastaLLL · May 3, 2021, 5:34am

Hi,

Do you use pure TensorRT python API or the integrated version in TensorFlow or PyTorch.

For pure TensorRT API, all the inference jobs can be submitted by the enqueue function.
If the queue is never empty, GPU will keep inferencing it and the loading is always maximum.
But if you submit an input data periodically, ex. every 33 ms, GPU load will decrease when the period that queue is empty.

Thanks.

ignacio.penasfernandez · May 5, 2021, 9:07am

Hello,

I use TensorRT python API.

I will try it with the Queue system.

On the other hand, I have a question regarding the GPU core usage, is there a way of knowing/checking which cores are being used at any moment in time? I am assuming that TensorRT is intended to use tensor cores due to its low power consumption and high performance, but I may be wrong.

Thanks for you help.

AastaLLL · May 6, 2021, 6:36am

Hi,

This is decided by the the GPU scheduler.
If Tensor core is idle and the task is supported, it is expected to use Tensor Core.

You can use profiler to get the usage information.
Please check below command for more information :

Thanks.

ignacio.penasfernandez · May 14, 2021, 12:07pm

Hello,

Thank you for your help.

It has taken me a while to understand the tool and learn to interpret the results. It is still curious to me that when I run a profile like that only ReLu operations seems to have a level over 0.

Is there a way of identifying which Kernels of those listed in the log are actually calling Tensor cores? I haven’t found any documentation related to the kernels names so I am a bit lost here.

On the other hand, I have also been trying to run the system-profiling flag while profiling my application but I keep getting a warning saying that the underlying platform is not compatible. Aren’t the Jetson Boards compatible with this option? If so, is there a way can I relate the power consumption to the profiler?

Thank you in advance.
Ignacio

profile sample
“Xavier (0)”,“void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=6, int=5, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)”,5,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void CUTENSOR_NAMESPACE::tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, __half, float, __half, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=1, int=32 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=256 const **, cutensorOperator_t, void const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )”,3,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=7, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)”,5,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void CUTENSOR_NAMESPACE::vectorized_tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, __half, float, float, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=1, int=32 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=256 const **, cutensorOperator_t, void const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )”,18,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void cuInt8::nchwToNchhw2<__half>(__half const , __half, int, int, int, int, int, int, cuInt8::ReducedDivisorParameters)”,6,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”

AastaLLL · May 19, 2021, 8:15am

Hi,

Sorry that I didn’t notice that you are using Nano.
Please noted that Tensor Core only available on the GPU architecture > 7.x.

Currently, for Jetson platform, only Xavier and XavierNX has Tensor Core.

Thanks.

ignacio.penasfernandez · May 21, 2021, 9:16am

Hello,

Thanks for letting me know, I wasn’t sure if the nano boards were equipped with Tensor Cores.

Anyway, I am performing the tests in three boards at the same time Nano, AGX and NX. And, still I would like to know how I can identify the the kernels being use. As for example, when I run a single convolution layer in AGX the kernel which computes the operation is called something like void cuInt8::nchwToNchhw2 which I don’t have a clue of what it means. Isn’t there a reference where I can check the kernels meaning?

Also, I see that the --system-profile doesn’t seem to be available in this devices, is it true? I would very much like to see the power consumption in parallel of the performance.

Let me attach and example of the mentioned convolution profile.
visual_profile_conv.nvvp (8.8 MB)

Thank you in advance.
Ignacio

AastaLLL · June 1, 2021, 6:10am

Hi,

cuInt8::nchwToNchhw2 is one kind of the format transform.
TensorRT will convert the data from NCHW to NCHHW2 for INT8 operations.

Thanks.

Topic		Replies	Views
TensorRT uses GPU alone or mix of CPU & GPU Jetson Nano tensorrt	5	612	October 18, 2021
TensorRT for Jetson Nano Jetson Nano tensorrt	2	516	October 24, 2022
tensorRT inference Running Jetson Nano tensorrt , jetson-inference	4	919	October 15, 2021
Low FPS on Jetson Nano using TensorRT Jetson Nano tensorrt , tensorflow	7	1342	August 27, 2020
The TensorRT inference API consumes more CPU resources（Jetson Xavier NX） Jetson Xavier NX tensorrt , cudnn	9	167	February 26, 2025
Extremely slow TF Lite inference on Jetson Xavier NX Jetson Xavier NX tensorflow	11	208	January 20, 2025
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	19	1905	October 18, 2021
Loading Tensorflow model on to Jetson Nano Jetson Nano tensorflow , nano2gb	7	2662	October 15, 2021
Performance difference between Jetpack and TensorRT versions Jetson Nano tensorrt , jetson-inference	7	581	June 21, 2023
How to measure Tensor core utilization using NVIDIA profiling tools such as Nsight System, DLProf, nvprof etc TensorRT cudnn	4	2021	January 31, 2024

Constant GPU inference power

Related topics