Nemo NMT training

Hi!

I have been trying to teach an English to German NMT model from this tutorial Tutorial — nemo 0.11.0 文档. Basically, I picked up the dataset and the tokenizer from the tutorial and configured the training run by using newer tutorials (e.g., Machine Translation Models — NVIDIA NeMo 1.4.0 documentation, and NeMo/enc_dec_nmt.py at main · NVIDIA/NeMo · GitHub). The command I use is:

python examples/nlp/machine_translation/enc_dec_nmt.py
–config-path=/train-nemo/configs
–config-name=aayn_base
do_testing=True
trainer.gpus=-1
~trainer.max_epochs
+trainer.max_steps=200000
+trainer.val_check_interval=1000
+exp_manager.name=aayn_base_tutorial
+exp_manager.exp_dir=/experiments
+exp_manager.create_checkpoint_callback=True
+exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU
+exp_manager.checkpoint_callback_params.mode=max
+exp_manager.checkpoint_callback_params.save_top_k=5
model.preproc_out_dir=/data/nmt_tutorial/out
model.train_ds.src_file_name=/data/nmt_tutorial/train.en
model.train_ds.tgt_file_name=/data/nmt_tutorial/train.de
model.train_ds.use_tarred_dataset=True \
model.validation_ds.src_file_name=/data/nmt_tutorial/eval/newstest2015.en
model.validation_ds.tgt_file_name=/data/nmt_tutorial/eval/newstest2015.de
model.test_ds.src_file_name=/data/nmt_tutorial/eval/newstest2016.en
model.test_ds.tgt_file_name=/data/nmt_tutorial/eval/newstest2016.de
model.encoder_tokenizer.tokenizer_model=/data/nmt_tutorial/bpe8k_yttm.model
model.decoder_tokenizer.tokenizer_model=/data/nmt_tutorial/bpe8k_yttm.model
model.src_language=en
model.tgt_language=de

The config file is the 24x6 network used in a lot of your examples. Note that I get the same result/behavior if I use the smaller, 6x6 network. The training seems to work for a while, but after some time it seems likes everything goes haywire. I start getting these outputs:


Epoch 2: 4%|▍ | 6800/176468 [32:22<13:27:33, 3.50it/s, loss=nan, v_num=]
Epoch 2: 4%|▍ | 6800/176468 [32:22<13:27:33, 3.50it/s, loss=nan, v_num=]
Epoch 2: 4%|▍ | 6801/176468 [32:22<13:27:33, 3.50it/s, loss=nan, v_num=]
Epoch 2: 4%|▍ | 6801/176468 [32:22<13:27:33, 3.50it/s, loss=nan, v_num=]

Besides the nan loss, the val_sacreBLEU also drops to 0. I was able to partially remedy this by setting trainer.precision=32. This way the above problem occurs much less frequently, but I am still not able to reproduce results claimed by your tutorials (e.g. val_sacreBLEU around 40). My end goal is to train an English to Slovenian translator, there I am having similar issues than with the above tutorial. Which is why I decided to first get this up and running. I am training on an DGX A100 machine if that is important. If anyone has any ideas what could be wrong I would be extremely grateful. Thanks!