Is there tensorcore kernel for 3D convolution?

I have tested 2D convolution and 3D convolution using cuDNN library with c++ API in order to achieve tensorcore acceleration.

The environment is as follow:
Windows 10
cuda 10.0
cudnn 7.6.5
visual studio 2017
RTX 2080 TI

It seems that 3D convolution does not have a fp16-optimized Tensor core kernel and any acceleration. I used Nsight System profiling tool to know the kernel function of each test case.

I tested following configuration:

argument

[tensorcore flag, data type, format, # of iteration, batch_size, in_channels, out_channels, image height, image width] --> [used kernel, time (sec)]

2D Convolution test (3x3 conv)

[CUDNN_DEFAULT_MATH, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, 4000, 8, 64, 64, 128, 128] --> [volta_scudnn_128x64_relu_small_nn_v1, 3.1 sec]
[CUDNN_DEFAULT_MATH, CUDNN_DATA_HALF, CUDNN_TENSOR_NCHW, 4000, 8, 64, 64, 128, 128] --> [volta_hcudnn_128x128_relu_small_nn_v1, 3.1 sec]
[CUDNN_TENSOR_OP_MATH, CUDNN_DATA_HALF, CUDNN_TENSOR_NCHW, 4000, 8, 64, 64, , 128, 128] --> [turing_h1688cudnn_256x64_sliced1x2_ldg8_relu_exp_small_nhwc_tn_v1, 1.3 sec]

3D Convolution test (3x3x3 conv)

[CUDNN_DEFAULT_MATH, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, 100, 1, 64, 64, 128, 128, 128] --> [volta_scudnn_128x64_stridedB_splitK_small_nn_v1, 3.8 sec]
[CUDNN_DEFAULT_MATH, CUDNN_DATA_HALF, CUDNN_TENSOR_NCHW, 100, 1, 64, 64, 128, 128, 128] --> [volta_hcudnn_128x128_stridedB_splitK_small_nn_v1, 3.75 sec]
[CUDNN_TENSOR_OP_MATH, CUDNN_DATA_HALF, CUDNN_TENSOR_NCHW, 100, 1, 64, 64, 128, 128, 128] --> [volta_hcudnn_128x128_stridedB_splitK_small_nn_v1, 3.8 sec]

We could know that 2D Convolution uses optimized kernel ‘turing_h1688cudnn_256x64_sliced1x2_ldg8_relu_exp_small_nhwc_tn_v1’ and achieves acceleration when using CUDNN_TENSOR_OP_MATH flag and fp16 type.

However, 3D Convolution does not use optimized kernel, rather uses non-tensorcore kernel ‘volta_hcudnn_128x128_stridedB_splitK_small_nn_v1’.

I would like to know whether an optimized tensorcore kernel for 3D Convolution exists or not.

If there exists the optimized tensorcore kernel for 3D Convolution, what could be the name of it?

Hi,

We have some 3D tensor core support in 7.6.5 for Volta.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/sdk/cudnn-best-practices/index.html#rec-settings-3d-conv

Thanks

I have tested in V100 for 3d Convolution and achieves acceleration, but the same code in T4 does not have any acceleration.
Is there any diffenent between using sensor core on T4 and V100?

The environment is as follow:
Centos 6.6
cuda 10.0
cudnn 7.6.5
V100/T4

convolution’s info:
kernel(3,3,3)
pad(1,1,1)
stide(1,1,1)
dilate(1,1,1)
input shape(32, 32, 32)
input output and filter’s dtype is fp16, batch size and channel are all multiple of 8

Hi,

As mentioned earlier sine we have some 3D tensor core support in 7.6.5 only for Volta. You are getting acceleration on V100 and not on T4.
T4 - NVIDIA Turing architecture
V100 - NVIDIA Volta architecture

Thanks