It was working on v100(cuda:10.0 and TensorFlow 1.15) but not working on A6000 using multiple gpus on the docker image nvcr.io/nvidia/tensorflow:20.10-tf1-py3 it is giving nan loss on random epochs. There is a similar issue with A100 on Deepspeech discourse NVIDIA A100: Loss nan when training on bare metal - - DeepSpeech - Mozilla Discourse
Hi @hpathak336 ,
This doesn’t look like cudnn issue, I recommend you to raise it to the concerned forum.
Yes thanks @AakankshaS for the response this issue is related with tensorflow using horovod solved this not a cudnn issue.