CUDA HW idle time

As a result of profiling the inference system written using cudnn, I can check the CUDA HW status as follows.

I simply repeated convolution, relu, and memset using cudnn. I’m curious as to why the gpu is experiencing idle time.

In some sections, as above, there are sections where the gpu works non-stop. I would like to know which part is causing these differences.

Hi,

First, please remember to maximize the device performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Could you share how you profile the cuDNN inference performance?

If you use real image data as input, please check the image decoder/pre-processing performance as well.
If the preprocessing is not fast enough, GPU might need to wait for the data and has some idle time.

Thanks.

@AastaLLL Thank you for answer. In my opinion, if it is a preprocessing or device environment problem, the reason why idle time occurs only in some sections cannot be explained. Although profiling was done several times, such idle time occurs only in a specific section. I initially thought that the characteristic of the kernel function selected by cudnn was causing such idle time. However, when I checked, even the same kernel function sometimes has an idle time before being executed and sometimes it doesn’t. For reference, if “CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM” is selected as the convolution algorithm, it is confirmed that the gpu does not have idle time as shown in the figure below. The kernel function called at this time is “implicit_convolve_sgemm”.
I am collecting logs from orin with “nsys profile” and visualizing them on the host PC.

Hi,

Could you share the reproducible source and steps so we can check it in our environment as well?
Thanks.

Makefile (13.1 KB)
fp32_conv.cu (5.5 KB)
Please understand that I cannot upload our company’s source code because company security is a problem.
I just wrote sample code for convolution test. Please find attached files.
When CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM is selected as the algorithm for convolution operation, there is no idle time between kernel functions as shown in the figure below.

However, intermittent idle time occurs when CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM is selected.

As in this case, I would like to know the root cause of the idle time occurring even when there are no other commands in between consecutive calls to cudnn api .

fp32_test6.nsys-rep (1.9 MB)
fp32_test5.nsys-rep (1.9 MB)

A profiling file containing the results shown in the figure above is also attached.

@AastaLLL
Can I expect technical support from nvidia on this matter?
Thank you

Hi,

Thanks for your patience.

We are trying to reproduce this issue internally.
A possible reason is the work queue number is not enough.

Could you try to increase the queue number and test it again?

export CUDA_DEVICE_MAX_CONNECTIONS=32

More information about the kernel work queue can be found in the below doc:
https://docs.nvidia.com/deploy/mps/index.html#topic_5_2_4
Thanks.

Thanks for the guide. But the gpu idle time is still showing up.

@AastaLLL
As you can see the code attached above, I would like to know why the gpu idle time occurs even though only cudnn’s convolution forward function is repeatedly operated. I think there is no more primitive way of convolution operation using gpu than this. As far as I know, cudnn is more low level than tensorrt. I’d like to figure out a way to repeat the convolution operation a bit more efficiently.

Hi,

Thanks for your testing.

We are checking this issue internally.
Will share more information with you later.

Thanks.

@AastaLLL
Thanks for reviewing this issue. I am waiting for your reply.

Hi,

Thanks for your patience.

Is this issue captured by the Nsight System file shared on Nov 9?
We open the file with v2022.3.3 and the output looks different from yours.

Please let us know if anything missing in our setting.
Thanks.

@AastaLLL

It was captured by Version: 2022.4.1.21-0db2c85 Windows-x64.
Thank you

Hi,

Somehow we still cannot see the expected output with Windows Nsight System 2022.4.1.
Could you double-check if the files can be opened in your environment or attach a new one?

Thanks.

Dear AastaLLL
@AastaLLL
I confirmed that it opens normally in my colleague’s development environment as well. Is it difficult to profile directly using the source code shared above? If you have any difficulties using the code I shared, I will update it again. The profile file I shared was also extracted using the source code.
Thank you

HI,

Thanks for the confirmation.
Not sure if something is missing in my environment. Will double-check it again.

Thanks.

Hi,

Sorry for the late update.

Back to cuDNN performance, below are the description of two chosen algorithms:

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

This algorithm expresses the convolution as a matrix product without actually explicitly forming the matrix that holds the input tensor data.

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

This algorithm expresses convolution as a matrix product without actually explicitly forming the matrix that holds the input tensor data, but still needs some memory workspace to precompute some indices in order to facilitate the implicit construction of the matrix that holds the input tensor data.

It looks like PRECOMP version requires some index computation.
Do you find other stream/processes is working during the GPU idle period?
If there are some dependencies from the data, GPU will need to wait for the input.

Thanks.

@AastaLLL Thank you for answer. I’d like to ask you a few more questions.

  1. Should the index calculation in CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM be calculated every time inference even if the network does not change?

  2. Is there any way to get rid of gpu idle time while using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM? If the index calculation is being done on the cpu, can’t we hide that time?

  3. In the case of inference using TensorRT, there is no gpu idle time in the profiling result. Maybe TensorRT isn’t using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM at all?

Thanks again.

Hi,

The idle time should be device-dependent.
TensorRT will evaluate all the possible algorithms and choose a faster one.

Do you have any other applications running at the same time?
This will help us find the source that GPU is waiting for.

Thanks.

Thank you for your patience in responding to this issue.