According this blog : https://devblogs.nvidia.com/tensor-ops-made-easier-in-cudnn/ and this document in cudnn(https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-rnn-functions),
There is a easy way to use float16 TensorCore in Xavier Volta GPU, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION flag can convert float32 into float16 data , and using tensorCore do the compute.
I tried the above method, and the convolution part works, I will put the modified cudnn samples code later.
the profiler shows:
when compute conv, the kernel changed from
Which means AUTO CONVERT worked, and TensorCore is utilized.
But when In try this method in RNN mode, it always failed, the proflier show there no half kernel in invoke during RNN compute.
the FC kkernel between LSTM Cell is still:
My input size, hidden size, batch size already aligned with 8, as the document required.
Because Cudnn is like a black box to me, so could someone help on this ?
The modified cudnn sample code, can download with this link: https://www.dropbox.com/s/up18uoyq9szwaqx/cudnn_samples_v7_auto_conv_tc.tgz?dl=0:
The RNN running command:
cd RNN; make clean; make nvprof ./RNN 24 8 512 64 2
Ubuntu 18.04.2 LTS \n \l
Linux jetson-0423218010724 4.9.108-tegra #1 SMP PREEMPT Wed Oct 31 15:17:21 PDT 2018 aarch64 aarch64 aarch64 GNU/Linux
cudnn_samples_v7_auto_conv_tc.tgz.zip (6.87 MB)