Need suggestions on the gradient explosion of NeMo's QuartzNet?

I trained Quartznet15x5 model using NeMo on Thai and English alphabets.
I changed labels, used 8k wav files, and trained on two A100 GPUs.

Without further configuration modification, I found the model faces gradient explosion and fails to converge.
After 100k steps, all validation losses are NaN.
hparams.yaml.txt (7.1 KB)

I also tried to change the learning rate and weight decay rate.
Since the default setup may be configured for 8 cards training, I increased the batch size to 64 and decreased the learning rate to 5e-3.
The gradient explosion still exists.
hparams-1.yaml.txt (7.1 KB)

Do you have any suggestions on this?
How to properly train the QuartzNet model?