Cudnn convolution performance on TX2, 12.5% GPU Occupancy.

Hi Guys,

There is a cudnn issue on TX2, here is the background and issue:

My application is a CV related application ,using tensorflow as framework, and porting this app to TX2, I did some profling, and found it was 75% time on conv, and from the kernel’s name, I learned tf.conv using cudnn, and 50% total time is a kernel named :

maxwell_scudnn_128x128_stridedB_splitK_small_nn

And after compare the kernel performance in nvvp 's profiling report, I found the GPU utilization only 60% on tx2, but same kernel can archive 95% in GTX 1080ti.

And I checked some detail in profile report, and found the root cause was the active block in 1080ti was 2, but in tx2 it was 1. and limit factor is shared memory:

quote the report :

The kernel uses 34 KiB of shared memory for each block. This shared memory usage is likely preventing the kernel from fully
utilizing the GPU. Device “NVIDIA Tegra X2” is configured to have 64 KiB of shared memory for each SM. Because the kernel
uses 34 KiB of shared memory for each block each SM is limited to simultaneously executing 1 block (8 warps). Chart “Varying
Shared Memory Usage” below shows how changing shared memory usage will change the number of blocks that can execute on
each SM.
Optimization: Reduce shared memory usage to increase the number of blocks that can execute on each SM. You can also increase
the number of blocks that can execute on each SM by increasing the amount of shared memory available to your kernel. You do
this by setting the preferred cache configuration to “prefer shared”.

but in 1080ti, the limited factor is register

GPU Utilization Is Limited By Register Usage
The kernel uses 128 registers for each thread (32768 registers for each block). This register usage is likely preventing the kernel
from fully utilizing the GPU. Device “GeForce GTX 1080 Ti” provides up to 65536 registers for each block. Because the kernel
uses 32768 registers for each block each SM is limited to simultaneously executing 2 blocks (16 warps). Chart “Varying Register
Count” below shows how changing register usage will change the number of blocks that can execute on each SM.
Optimization: Use the -maxrregcount flag or the launch_bounds qualifier to decrease the number of registers used by each
thread. This will increase the number of blocks that can execute on each SM. On devices with Compute Capability 5.2 turning
global cache off can increase the occupancy limited by register usage.

So, could some cudnn guys can check this issue, only reduce 2K share memory usage can speed up 2x in tx2 ?

Hi,

Performance issue may comes from the implementation of TensorFlow.
It’s recommended to use our TensorRT engine instead of TensorFlow.

We optimize inference implementation based on the GPU architecture and can reach better performance.
Here is a tutorial to run TensorFlow model on TensorRT:

Thanks.

Thanks for reply,

Actually I cann’t use TRT because my application is little more complicated than classification, and it require some op like 3DConv, which TRT is not supported yet.
Actually tensorflow using CuDNN lib, maybe someone can look code of CuDNN, see whether reduce some share memory usage of this kernel to archive better performance in TX2?

Hi,

TensorFlow kernel is implemented for desktop GPU.
The selection of cores and memory may not be the optimal solution for Jetson platform.

We also have cuDNN sample in both TX2 and x86-based environment.
Could you check if issue can be reproduced with cuDNN official samples?

For example, with JetPack3.1:

$ cp -r /usr/src/cudnn_samples_v6/ .
$ cd cudnn_samples_v6/
$ cd mnistCUDNN/
$ make
$ nvprof -o [filename] ./mnistCUDNN

Thanks.