Cudnn's TensorOP AUTO CONVERT not work on Xavier in RNN（LSTM） model.

kzjeef · July 15, 2019, 6:55am

Hi There,

According this blog : https://devblogs.nvidia.com/tensor-ops-made-easier-in-cudnn/ and this document in cudnn(https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-rnn-functions),
There is a easy way to use float16 TensorCore in Xavier Volta GPU, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION flag can convert float32 into float16 data , and using tensorCore do the compute.

I tried the above method, and the convolution part works, I will put the modified cudnn samples code later.

the profiler shows:

when compute conv, the kernel changed from

volta_scudnn_128x64_relu_interior_nn_v1

to

volta_s884cudnn_fp16_256x64_ldg8_relu_exp_interior_nhwc2nchw_tn_v1
volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_interior_nhwc2nchw_tn_v1

Which means AUTO CONVERT worked, and TensorCore is utilized.

But when In try this method in RNN mode, it always failed, the proflier show there no half kernel in invoke during RNN compute.
the FC kkernel between LSTM Cell is still:

"volta_sgemm_128x64_nn"

My input size, hidden size, batch size already aligned with 8, as the document required.

Because Cudnn is like a black box to me, so could someone help on this ?

The modified cudnn sample code, can download with this link: Dropbox - File Deleted

The RNN running command:

cd RNN; 
make clean;
make 
nvprof ./RNN 24 8 512 64 2

Machine: Xavier
Ubuntu 18.04.2 LTS \n \l
Linux jetson-0423218010724 4.9.108-tegra #1 SMP PREEMPT Wed Oct 31 15:17:21 PDT 2018 aarch64 aarch64 aarch64 GNU/Linux

cudnn_samples_v7_auto_conv_tc.tgz.zip (6.87 MB)

AastaLLL · July 16, 2019, 3:04am

Hi,

Thanks for your issue.
We will check this and update more information with you later.

AastaLLL · July 25, 2019, 5:19am

Hi,

We just got the feedback from our internal team. This issue is not a bug.

You will need to call the cudnnSetRNNMatrixMathType after cudnnSetRNNDescriptor_v6.
Since cudnnSetRNNDescriptor_v6 will re-initialize the field into CUDNN_DEFAULT_MATH.

You can find this information here:
https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnSetRNNMatrixMathType

<b>rnnDesc</b>
    Input. A previously created and initialized RNN descriptor.

Thanks.

Topic		Replies	Views
inconsistency in cudnnRNNMode_t GPU-Accelerated Libraries	0	645	May 3, 2016
Depthwise convolution in cudnn fp16 is slow than fp32 Jetson AGX Xavier cudnn	6	1457	October 18, 2021
CUDNN: cudnnConvolutionForward very bad performance(very long execution time) on xavier Jetson AGX Xavier	4	1135	October 18, 2021
Modifying the RNN example to use CUDNN_DATA_INT8 cuDNN	1	682	August 27, 2018
Converting tensorflow frozen rnn model to tensorRT graph with errors Jetson AGX Xavier	2	567	October 18, 2021
On Jetson Xavier, which is faster: pseudo FP16 or true FP16? Jetson AGX Xavier tensorrt	5	581	June 29, 2022
cudnnRNNForward() issue cuDNN cuda	5	1223	August 4, 2022
cudnn6 slow and problematic on TX2, JetPack 3.1 Jetson TX2	22	2809	October 18, 2021
Doesn't cudnn 9.10 support Conv1d operator in jetson platform? cuDNN cuda , pytorch , cudnn	0	55	July 16, 2025
Cudnn convolution performance(fp32, fp16. int8) on the jetson xavier cuDNN	3	1083	June 14, 2022

Cudnn's TensorOP AUTO CONVERT not work on Xavier in RNN（LSTM） model.

Related topics