Tao speech_to_text evaluate+infer show very weak results

Hi,
Please use below finetune.yaml and pretrained model (Speech to Text English QuartzNet | NVIDIA NGC)

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# TLT spec file for fine-tuning a previously trained ASR model (Jasper or QuartzNet) on the MCV Russian dataset.
 
trainer:
  max_epochs: 3   # This is low for demo purposes
 
tlt_checkpoint_interval: 1
 
# Whether or not to change the decoder vocabulary.
# Note that this MUST be set if the labels change, e.g. to a different language's character set
# or if additional punctuation characters are added.
change_vocabulary: false
#change_vocabulary: true
 
# Fine-tuning settings: training dataset
finetuning_ds:
  manifest_filepath: ???
  sample_rate: 16000
  labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
           "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
  batch_size: 32
  trim_silence: true
  max_duration: 16.7
  shuffle: true
  is_tarred: false
  tarred_audio_filepaths: null
 
# Fine-tuning settings: validation dataset
validation_ds:
  manifest_filepath: ???
  sample_rate: 16000
  labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
           "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
  batch_size: 32
  shuffle: false
 
# Fine-tuning settings: optimizer
optim:
  name: novograd
  lr: 0.001

I run evaluation before finetuning.

# speech_to_text evaluate -e specs/speech_to_text/evaluate.yaml -k tlt_encode -m speechtotext_english_quartznet.tlt -r evalution_result_quartznet_tlt test_ds.manifest_filepath=data/an4_converted/test_manifest.json

DATALOADER:0 TEST RESULTS
{'test_loss': 1.9822663068771362, 'test_wer': 0.08408796787261963}

Run finetuning for 100 epochs.

# speech_to_text finetune -e specs/speech_to_text/finetune.yaml -k tlt_encode -m speechtotext_english_quartznet.tlt -r result_finetune_again_2 finetuning_ds.manifest_filepath=data/an4_converted/train_manifest.json validation_ds.manifest_filepath=data/an4_converted/test_manifest.json trainer.max_epochs=100

Get below result.

# speech_to_text evaluate -e specs/speech_to_text/evaluate.yaml -k tlt_encode -m result_finetune_again_2/checkpoints/finetuned-model.tlt -r evalution_result_quartznet_tlt_finetune test_ds.manifest_filepath=data/an4_converted/test_manifest.json

DATALOADER:0 TEST RESULTS
{'test_loss': 1.7699419260025024, 'test_wer': 0.05304010212421417}

That means finetuning takes effect and gets better result.