Constant GPU inference power

Hello,

I am trying to make profile for the Jetson Nano components independently. While most of the components instant power may vary depending on the load (e.g. CPU power consumption depends on the % usage of it).

When modeling the GPU, I found out that the actual power consumption (reported by TegraStats utility) when making inferences is constant at a maximum. For the inference I use TensorRT framework. I tried with full models of 120 layers depth and with a single layer model, the power reported is still the same. Energy will change depending on the inference performance, but one would expect not to need the same amount of cores for such different loads, I may by wrong though. Is this behavior expected?

Thank you in advance

Best regards
Ignacio

Hi,
Probably the loading is heavy in both cases. Please try this sample and check if yo see difference:

/usr/src/jetson_multimedia_api/samples/02_video_dec_cuda

Jetson Linux API Reference: 02_video_dec_cuda
The sample contains simple CUDA code to consume little GPU loading. Please run it as a comparison.

Hello,

Thanks for the response.

You were right, this encoding pipeline, partially occupies the GPU. Then, can I assume that inferring with any model size with TRT framework will occupy 100% of the GPU? Btw, I found this behavior in three boards: Nano, AGX and NX.

For a single layer model, maybe in the Nano board could make sense to need the whole GPU, but in the AGX seems kind of overkill.

Bests
Ignacio

Hi,
You may check if it is specific to the model. We have ResNet10 model in DeepStream SDK. You can try the command on Jetson platforms for comparison:

/opt/nvidia/deepstream/deepstream-5.1/samples/configs/deepstream-app$ deepstream-app -c source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_nano.txt

Hi,

You are right. I can see now non-static GPU usage.

Do you have any idea why it may be using the full GPU with a single ReLu layer model? The input is 224x224x3. As well with a MobilenetV1 model with the same input.

Also, I am re-using the python TensorRT framework examples provided as base.

Thanks

Hi,

Do you use pure TensorRT python API or the integrated version in TensorFlow or PyTorch.

For pure TensorRT API, all the inference jobs can be submitted by the enqueue function.
If the queue is never empty, GPU will keep inferencing it and the loading is always maximum.
But if you submit an input data periodically, ex. every 33 ms, GPU load will decrease when the period that queue is empty.

Thanks.

Hello,

I use TensorRT python API.

I will try it with the Queue system.

On the other hand, I have a question regarding the GPU core usage, is there a way of knowing/checking which cores are being used at any moment in time? I am assuming that TensorRT is intended to use tensor cores due to its low power consumption and high performance, but I may be wrong.

Thanks for you help.

Hi,

This is decided by the the GPU scheduler.
If Tensor core is idle and the task is supported, it is expected to use Tensor Core.

You can use profiler to get the usage information.
Please check below command for more information :

Thanks.

Hello,

Thank you for your help.

It has taken me a while to understand the tool and learn to interpret the results. It is still curious to me that when I run a profile like that only ReLu operations seems to have a level over 0.

Is there a way of identifying which Kernels of those listed in the log are actually calling Tensor cores? I haven’t found any documentation related to the kernels names so I am a bit lost here.

On the other hand, I have also been trying to run the system-profiling flag while profiling my application but I keep getting a warning saying that the underlying platform is not compatible. Aren’t the Jetson Boards compatible with this option? If so, is there a way can I relate the power consumption to the profiler?

Thank you in advance.
Ignacio

profile sample
“Xavier (0)”,“void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=6, int=5, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)”,5,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void CUTENSOR_NAMESPACE::tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, __half, float, __half, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=1, int=32 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=256 const **, cutensorOperator_t, void const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )”,3,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=7, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)”,5,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void CUTENSOR_NAMESPACE::vectorized_tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, __half, float, float, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=1, int=32 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=256 const **, cutensorOperator_t, void const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )”,18,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void cuInt8::nchwToNchhw2<__half>(__half const , __half, int, int, int, int, int, int, cuInt8::ReducedDivisorParameters)”,6,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”

Hi,

Sorry that I didn’t notice that you are using Nano.
Please noted that Tensor Core only available on the GPU architecture > 7.x.

Currently, for Jetson platform, only Xavier and XavierNX has Tensor Core.
https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix

Thanks.

Hello,

Thanks for letting me know, I wasn’t sure if the nano boards were equipped with Tensor Cores.

Anyway, I am performing the tests in three boards at the same time Nano, AGX and NX. And, still I would like to know how I can identify the the kernels being use. As for example, when I run a single convolution layer in AGX the kernel which computes the operation is called something like void cuInt8::nchwToNchhw2 which I don’t have a clue of what it means. Isn’t there a reference where I can check the kernels meaning?

Also, I see that the --system-profile doesn’t seem to be available in this devices, is it true? I would very much like to see the power consumption in parallel of the performance.

Let me attach and example of the mentioned convolution profile.
visual_profile_conv.nvvp (8.8 MB)

Thank you in advance.
Ignacio

Hi,

cuInt8::nchwToNchhw2 is one kind of the format transform.
TensorRT will convert the data from NCHW to NCHHW2 for INT8 operations.

Thanks.