Hi Guys,
There is a cudnn issue on TX2, here is the background and issue:
My application is a CV related application ,using tensorflow as framework, and porting this app to TX2, I did some profling, and found it was 75% time on conv, and from the kernel’s name, I learned tf.conv using cudnn, and 50% total time is a kernel named :
maxwell_scudnn_128x128_stridedB_splitK_small_nn
And after compare the kernel performance in nvvp 's profiling report, I found the GPU utilization only 60% on tx2, but same kernel can archive 95% in GTX 1080ti.
And I checked some detail in profile report, and found the root cause was the active block in 1080ti was 2, but in tx2 it was 1. and limit factor is shared memory:
quote the report :
The kernel uses 34 KiB of shared memory for each block. This shared memory usage is likely preventing the kernel from fully
utilizing the GPU. Device “NVIDIA Tegra X2” is configured to have 64 KiB of shared memory for each SM. Because the kernel
uses 34 KiB of shared memory for each block each SM is limited to simultaneously executing 1 block (8 warps). Chart “Varying
Shared Memory Usage” below shows how changing shared memory usage will change the number of blocks that can execute on
each SM.
Optimization: Reduce shared memory usage to increase the number of blocks that can execute on each SM. You can also increase
the number of blocks that can execute on each SM by increasing the amount of shared memory available to your kernel. You do
this by setting the preferred cache configuration to “prefer shared”.
but in 1080ti, the limited factor is register
GPU Utilization Is Limited By Register Usage
The kernel uses 128 registers for each thread (32768 registers for each block). This register usage is likely preventing the kernel
from fully utilizing the GPU. Device “GeForce GTX 1080 Ti” provides up to 65536 registers for each block. Because the kernel
uses 32768 registers for each block each SM is limited to simultaneously executing 2 blocks (16 warps). Chart “Varying Register
Count” below shows how changing register usage will change the number of blocks that can execute on each SM.
Optimization: Use the -maxrregcount flag or the launch_bounds qualifier to decrease the number of registers used by each
thread. This will increase the number of blocks that can execute on each SM. On devices with Compute Capability 5.2 turning
global cache off can increase the occupancy limited by register usage.
So, could some cudnn guys can check this issue, only reduce 2K share memory usage can speed up 2x in tx2 ?