Megatron any-to-en conversion from nemo -> riva -> rmir

Please provide the following information when requesting support.

Hardware - GPU (A100/A30/T4/V100) : A10G
Hardware - CPU : AMD EPYC 7R32
Operating System : Ubuntu 22.04
Riva Version : 2.18.0

I’m using a g5.4xlarge based ec2 instance. I am finetunning the nemo NMT models megatron any-to-en and en-to-any. After finetunning i want to load tis model in riva quickstart 2.18.0. But i am facing problem with nemo2riva models conversion.

I am using nemo24.01.framwork container for finetunning . I installed nemo2riva(both 2.18 and 2.19) and tried converting the nemo model into riva but it gives following error althougth it(2.18) worked with billingual model but in megatron’s conversion it gives the following error :

traceback (most recent call last):
  File "/usr/local/bin/nemo2riva", line 8, in <module>
    sys.exit(nemo2riva())
  File "/usr/local/lib/python3.10/dist-packages/nemo2riva/cli/nemo2riva.py", line 49, in nemo2riva
    Nemo2Riva(args)
  File "/usr/local/lib/python3.10/dist-packages/nemo2riva/convert.py", line 87, in Nemo2Riva
    export_model(
  File "/usr/local/lib/python3.10/dist-packages/nemo2riva/cookbook.py", line 132, in export_model
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nemo2riva/cookbook.py", line 90, in export_model
    _, descriptions = model.export(
  File "/opt/NeMo/nemo/core/classes/exportable.py", line 114, in export
    out, descr, out_example = model._export(
  File "/opt/NeMo/nemo/core/classes/exportable.py", line 187, in _export
    self._prepare_for_export(output=output, input_example=input_example, **my_args)
  File "/opt/NeMo/nemo/core/classes/exportable.py", line 267, in _prepare_for_export
    replace_for_export(self)
  File "/opt/NeMo/nemo/utils/export_utils.py", line 457, in replace_for_export
    replace_modules(model, default_Apex_replacements)
  File "/opt/NeMo/nemo/utils/export_utils.py", line 426, in replace_modules
    swapped = expansions[m_type](m)
  File "/opt/NeMo/nemo/utils/export_utils.py", line 300, in replace_ParallelLinear
    mod.load_state_dict(n_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LinearWithBiasSkip:
        Unexpected key(s) in state_dict: "_extra_state". 
root@02f633c1c4a6:/workspace#

so I have tried every version of nemo2riva in the nemo24.0.1.framework container but ti gives the same error. then i changed the container to nemo22.11 and nemo2riva(2.18, 2.19) gives the same error. So I installed nemo2riva 2.14.0 and ran the following command and it convertrf then nemo model into riva.

nemo2riva --key tlt_encode --max-dim 1024 --out /workspace/megatron.riva /workspace/megatron/megatronnmt_any
_en_500m.nemo

then Iconverted it into rmir usning both riva-speech2.14.servicemaker and riva-speech 2.18 using the following command :

riva-build megatron_translation \
  --name nmt_multi_model \
  megatronnmt_custom_any_en_500m.rmir \
  megatronnmt_custom_any_en_500m.riva

I tried both of these models in riva quickstart. first i ran riva_init.sh and it converted them into model directories. but in riva_start.sh it does not loads them on riva server.

These were all the steps that were given by nvidia tutorials but they don’t work.If anyone can help please let me know.

What error do you get while running riva_start.sh?