CUDA HW idle time

soohyung.zhang · November 6, 2022, 9:30am

As a result of profiling the inference system written using cudnn, I can check the CUDA HW status as follows.

I simply repeated convolution, relu, and memset using cudnn. I’m curious as to why the gpu is experiencing idle time.

In some sections, as above, there are sections where the gpu works non-stop. I would like to know which part is causing these differences.

AastaLLL · November 7, 2022, 4:23am

Hi,

First, please remember to maximize the device performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Could you share how you profile the cuDNN inference performance?

If you use real image data as input, please check the image decoder/pre-processing performance as well.
If the preprocessing is not fast enough, GPU might need to wait for the data and has some idle time.

Thanks.

soohyung.zhang · November 7, 2022, 4:55am

@AastaLLL Thank you for answer. In my opinion, if it is a preprocessing or device environment problem, the reason why idle time occurs only in some sections cannot be explained. Although profiling was done several times, such idle time occurs only in a specific section. I initially thought that the characteristic of the kernel function selected by cudnn was causing such idle time. However, when I checked, even the same kernel function sometimes has an idle time before being executed and sometimes it doesn’t. For reference, if “CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM” is selected as the convolution algorithm, it is confirmed that the gpu does not have idle time as shown in the figure below. The kernel function called at this time is “implicit_convolve_sgemm”.
I am collecting logs from orin with “nsys profile” and visualizing them on the host PC.

AastaLLL · November 8, 2022, 3:04am

Hi,

Could you share the reproducible source and steps so we can check it in our environment as well?
Thanks.

soohyung.zhang · November 9, 2022, 1:29pm

Makefile (13.1 KB)
fp32_conv.cu (5.5 KB)
Please understand that I cannot upload our company’s source code because company security is a problem.
I just wrote sample code for convolution test. Please find attached files.
When CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM is selected as the algorithm for convolution operation, there is no idle time between kernel functions as shown in the figure below.

However, intermittent idle time occurs when CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM is selected.

As in this case, I would like to know the root cause of the idle time occurring even when there are no other commands in between consecutive calls to cudnn api .

fp32_test6.nsys-rep (1.9 MB)
fp32_test5.nsys-rep (1.9 MB)

A profiling file containing the results shown in the figure above is also attached.

soohyung.zhang · November 14, 2022, 12:52pm

@AastaLLL
Can I expect technical support from nvidia on this matter?
Thank you

AastaLLL · November 15, 2022, 5:25am

Hi,

Thanks for your patience.

We are trying to reproduce this issue internally.
A possible reason is the work queue number is not enough.

Could you try to increase the queue number and test it again?

export CUDA_DEVICE_MAX_CONNECTIONS=32

More information about the kernel work queue can be found in the below doc:
https://docs.nvidia.com/deploy/mps/index.html#topic_5_2_4
Thanks.

soohyung.zhang · November 15, 2022, 9:34am

Thanks for the guide. But the gpu idle time is still showing up.

soohyung.zhang · November 15, 2022, 11:00am

@AastaLLL
As you can see the code attached above, I would like to know why the gpu idle time occurs even though only cudnn’s convolution forward function is repeatedly operated. I think there is no more primitive way of convolution operation using gpu than this. As far as I know, cudnn is more low level than tensorrt. I’d like to figure out a way to repeat the convolution operation a bit more efficiently.

AastaLLL · November 18, 2022, 3:01am

Hi,

Thanks for your testing.

We are checking this issue internally.
Will share more information with you later.

Thanks.

soohyung.zhang · November 18, 2022, 9:38am

@AastaLLL
Thanks for reviewing this issue. I am waiting for your reply.

AastaLLL · November 21, 2022, 8:56am

Hi,

Thanks for your patience.

Is this issue captured by the Nsight System file shared on Nov 9?
We open the file with v2022.3.3 and the output looks different from yours.

Please let us know if anything missing in our setting.
Thanks.

soohyung.zhang · November 21, 2022, 9:02am

@AastaLLL

It was captured by Version: 2022.4.1.21-0db2c85 Windows-x64.
Thank you

AastaLLL · November 24, 2022, 7:34am

Hi,

Somehow we still cannot see the expected output with Windows Nsight System 2022.4.1.
Could you double-check if the files can be opened in your environment or attach a new one?

Thanks.

soohyung.zhang · December 1, 2022, 3:43pm

Dear AastaLLL
@AastaLLL
I confirmed that it opens normally in my colleague’s development environment as well. Is it difficult to profile directly using the source code shared above? If you have any difficulties using the code I shared, I will update it again. The profile file I shared was also extracted using the source code.
Thank you

AastaLLL · December 2, 2022, 6:09am

HI,

Thanks for the confirmation.
Not sure if something is missing in my environment. Will double-check it again.

Thanks.

AastaLLL · December 28, 2022, 9:10am

Hi,

Sorry for the late update.

Back to cuDNN performance, below are the description of two chosen algorithms:

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

This algorithm expresses the convolution as a matrix product without actually explicitly forming the matrix that holds the input tensor data.

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

This algorithm expresses convolution as a matrix product without actually explicitly forming the matrix that holds the input tensor data, but still needs some memory workspace to precompute some indices in order to facilitate the implicit construction of the matrix that holds the input tensor data.

It looks like PRECOMP version requires some index computation.
Do you find other stream/processes is working during the GPU idle period?
If there are some dependencies from the data, GPU will need to wait for the input.

Thanks.

soohyung.zhang · January 2, 2023, 1:30am

@AastaLLL Thank you for answer. I’d like to ask you a few more questions.

Should the index calculation in CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM be calculated every time inference even if the network does not change?
Is there any way to get rid of gpu idle time while using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM? If the index calculation is being done on the cpu, can’t we hide that time?
In the case of inference using TensorRT, there is no gpu idle time in the profiling result. Maybe TensorRT isn’t using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM at all?

Thanks again.

AastaLLL · January 4, 2023, 10:10am

Hi,

The idle time should be device-dependent.
TensorRT will evaluate all the possible algorithms and choose a faster one.

Do you have any other applications running at the same time?
This will help us find the source that GPU is waiting for.

Thanks.

soohyung.zhang · January 6, 2023, 5:13am

Thank you for your patience in responding to this issue.

Topic		Replies	Views
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1776	July 19, 2022
CUDA Parallel Convolution Scheduling Issues(cuDNN) cuDNN kernel , cudnn	2	79	April 29, 2025
cuDNN v6 INT8 convolution failing with CUDNN_STATUS_NOT_SUPPORTED cuDNN	12	5249	March 3, 2020
Unexpected cudnnConvolutionForward performance with varying input channels cuDNN	2	577	June 2, 2020
"Failed to get convolution algorithm" problem cuDNN	4	8499	September 7, 2019
Cudnn convolution performance by precision DRIVE AGX Xavier General driveos-cuda	6	1097	May 30, 2022
Why is 2-D convolution slower than the matrix product? CUDA Programming and Performance	17	6773	April 18, 2015
Why is my 'trivial' convolution kernel faster than cuDNN? CUDA Programming and Performance	4	473	May 29, 2022
Uncoherent timing of convolution using CUDA events CUDA Programming and Performance	3	350	July 27, 2023
Depthwise convolution in cudnn fp16 is slow than fp32 Jetson AGX Xavier cudnn	6	1346	October 18, 2021

CUDA HW idle time

Related topics