Hi There,
According this blog : https://devblogs.nvidia.com/tensor-ops-made-easier-in-cudnn/ and this document in cudnn(https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-rnn-functions),
There is a easy way to use float16 TensorCore in Xavier Volta GPU, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION flag can convert float32 into float16 data , and using tensorCore do the compute.
I tried the above method, and the convolution part works, I will put the modified cudnn samples code later.
the profiler shows:
when compute conv, the kernel changed from
volta_scudnn_128x64_relu_interior_nn_v1
to
volta_s884cudnn_fp16_256x64_ldg8_relu_exp_interior_nhwc2nchw_tn_v1
volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_interior_nhwc2nchw_tn_v1
Which means AUTO CONVERT worked, and TensorCore is utilized.
But when In try this method in RNN mode, it always failed, the proflier show there no half kernel in invoke during RNN compute.
the FC kkernel between LSTM Cell is still:
"volta_sgemm_128x64_nn"
My input size, hidden size, batch size already aligned with 8, as the document required.
Because Cudnn is like a black box to me, so could someone help on this ?
The modified cudnn sample code, can download with this link: Dropbox - File Deleted
The RNN running command:
cd RNN;
make clean;
make
nvprof ./RNN 24 8 512 64 2
Machine: Xavier
Ubuntu 18.04.2 LTS \n \l
Linux jetson-0423218010724 4.9.108-tegra #1 SMP PREEMPT Wed Oct 31 15:17:21 PDT 2018 aarch64 aarch64 aarch64 GNU/Linux
cudnn_samples_v7_auto_conv_tc.tgz.zip (6.87 MB)