CUDNN Batchnorm Backward result is not correct? Found big difference than CPU result.

We observed that the Batch Normal Backward calculation between CPU and CUDNN has big difference.
Can CUDNN expert share some information about the difference between CUDNN and CPU for Batch Normal Backward calculation?

  • The related call to CUDNN is “wrap::cudnnBatchNormalizationBackward()” (tensorflow-1.8.0/tensorflow/stream_executor/cuda/cuda_dnn.cc:Line-3138)

We are using following resources:

  • cuda-9.0.176.3
  • cudnn-9.0-linux-x64-v7 (libcudnn.so.7.0.5)
  • Nvidia TITAN V (V100), with nvidia-driver-x86_64-390.77.run (on Ubuntu-16.04)
  • tensorflow-gpu-1.8.0

Please check below for the detailed background:

When evaluating batch normal BACKWORD float16 calculation with CUDNN(Volta Tensor Core) and CPU:
We firstly dump a data set from a real model run (with CUDNN) then feed into following four standalone tensorflow scripts and got four outputs:

  1. CUDNN-with-input-float16, output-float16
  2. CUDNN-with-input-float32, output-float32-then-convert-to-float16
  3. CPU-with-input-float16, output-float16
  4. CPU-with-input-float32, output-float32-then-convert-to-float16

We expect that the four float16 outputs are very close and no big difference, but the real finding is:

  • the output 1 and 2 has no difference with the real reference data, (which proves the data is correct)
  • there’s no difference between output 1 and 2
  • there’s no difference between output 3 and 4
  • difference between 3(or 4) and 1(or 2) is unexpectedly big (average relative diff is about 67%)

Our concerns are:

  1. why the difference is such big between 3(or 4) and 1(or 2)?
  2. does the difference has any side effect to the convergence for models using CPU-float32 calculation?

Could any CUDNN expert share any info about the difference between CUDNN and CPU calculation for Batch Normal backward?