A6000: Training DeepSpeech giving nan loss

It was working on v100(cuda:10.0 and TensorFlow 1.15) but not working on A6000 using multiple gpus on the docker image nvcr.io/nvidia/tensorflow:20.10-tf1-py3 it is giving nan loss on random epochs. There is a similar issue with A100 on Deepspeech discourse NVIDIA A100: Loss nan when training on bare metal - - DeepSpeech - Mozilla Discourse

Hi @hpathak336 ,
This doesn’t look like cudnn issue, I recommend you to raise it to the concerned forum.

Thanks!

Yes thanks @AakankshaS for the response this issue is related with tensorflow using horovod solved this not a cudnn issue.

Hi @hpathak336,

It seems there was some additional fixes in latest cuDNN release (cudnn > 8.1)
Could you please try using latest NGC container and let us know if this resolves NaN issue in your case?
https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/rel_21-10.html#rel_21-10

Thanks