Cudnn convolution performance on TX2, 12.5% GPU Occupancy.

kzjeef · May 17, 2018, 4:11am

Hi Guys,

There is a cudnn issue on TX2, here is the background and issue:

My application is a CV related application ,using tensorflow as framework, and porting this app to TX2, I did some profling, and found it was 75% time on conv, and from the kernel’s name, I learned tf.conv using cudnn, and 50% total time is a kernel named :

maxwell_scudnn_128x128_stridedB_splitK_small_nn

And after compare the kernel performance in nvvp 's profiling report, I found the GPU utilization only 60% on tx2, but same kernel can archive 95% in GTX 1080ti.

And I checked some detail in profile report, and found the root cause was the active block in 1080ti was 2, but in tx2 it was 1. and limit factor is shared memory:

quote the report :

The kernel uses 34 KiB of shared memory for each block. This shared memory usage is likely preventing the kernel from fully
utilizing the GPU. Device “NVIDIA Tegra X2” is configured to have 64 KiB of shared memory for each SM. Because the kernel
uses 34 KiB of shared memory for each block each SM is limited to simultaneously executing 1 block (8 warps). Chart “Varying
Shared Memory Usage” below shows how changing shared memory usage will change the number of blocks that can execute on
each SM.
Optimization: Reduce shared memory usage to increase the number of blocks that can execute on each SM. You can also increase
the number of blocks that can execute on each SM by increasing the amount of shared memory available to your kernel. You do
this by setting the preferred cache configuration to “prefer shared”.

but in 1080ti, the limited factor is register

GPU Utilization Is Limited By Register Usage
The kernel uses 128 registers for each thread (32768 registers for each block). This register usage is likely preventing the kernel
from fully utilizing the GPU. Device “GeForce GTX 1080 Ti” provides up to 65536 registers for each block. Because the kernel
uses 32768 registers for each block each SM is limited to simultaneously executing 2 blocks (16 warps). Chart “Varying Register
Count” below shows how changing register usage will change the number of blocks that can execute on each SM.
Optimization: Use the -maxrregcount flag or the launch_bounds qualifier to decrease the number of registers used by each
thread. This will increase the number of blocks that can execute on each SM. On devices with Compute Capability 5.2 turning
global cache off can increase the occupancy limited by register usage.

So, could some cudnn guys can check this issue, only reduce 2K share memory usage can speed up 2x in tx2 ?

AastaLLL · May 17, 2018, 6:50am

Hi,

Performance issue may comes from the implementation of TensorFlow.
It’s recommended to use our TensorRT engine instead of TensorFlow.

We optimize inference implementation based on the GPU architecture and can reach better performance.
Here is a tutorial to run TensorFlow model on TensorRT:

Thanks.

kzjeef · May 22, 2018, 8:42am

Thanks for reply,

Actually I cann’t use TRT because my application is little more complicated than classification, and it require some op like 3DConv, which TRT is not supported yet.
Actually tensorflow using CuDNN lib, maybe someone can look code of CuDNN, see whether reduce some share memory usage of this kernel to archive better performance in TX2?

AastaLLL · May 24, 2018, 9:59am

Hi,

TensorFlow kernel is implemented for desktop GPU.
The selection of cores and memory may not be the optimal solution for Jetson platform.

We also have cuDNN sample in both TX2 and x86-based environment.
Could you check if issue can be reproduced with cuDNN official samples?

For example, with JetPack3.1:

$ cp -r /usr/src/cudnn_samples_v6/ .
$ cd cudnn_samples_v6/
$ cd mnistCUDNN/
$ make
$ nvprof -o [filename] ./mnistCUDNN

Thanks.

Topic		Replies	Views
TensorRT 2x slower than Cudnn for single Conv2D (74 ms vs. 156 ms) TensorRT	6	818	February 5, 2021
cudnn6 slow and problematic on TX2, JetPack 3.1 Jetson TX2	22	2527	October 18, 2021
TX2 - Tensorflow object detection model works with CPU but gives CUDNN error with GPU Jetson TX2 tensorflow	4	678	January 5, 2022
Jetson Xavier NX running with Cuda and TF1.5 Jetson Xavier NX tensorflow	6	1441	October 18, 2021
Ask for help about ERROR cudnnEngine.cpp (56) on TX2 TensorRT	5	730	October 12, 2021
Recap on tensorflow object detection API on TX2 Jetson TX2	9	4746	October 18, 2021
Expected Tensor RT 8 RAM Usage Jetson TX2 tensorrt	2	519	March 2, 2022
Tensorflow-gpu using high system memory, which is the bottleneck Jetson TX2 cuda , tensorflow	4	645	October 18, 2021
TensorFlow GPU runtime worse than CPU - TX2 Jetson TX2	14	4191	October 18, 2021
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	19	1695	October 18, 2021

Cudnn convolution performance on TX2, 12.5% GPU Occupancy.

Related topics