Hi,
Please use below finetune.yaml and pretrained model (Speech to Text English QuartzNet | NVIDIA NGC)
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# TLT spec file for fine-tuning a previously trained ASR model (Jasper or QuartzNet) on the MCV Russian dataset.
trainer:
max_epochs: 3 # This is low for demo purposes
tlt_checkpoint_interval: 1
# Whether or not to change the decoder vocabulary.
# Note that this MUST be set if the labels change, e.g. to a different language's character set
# or if additional punctuation characters are added.
change_vocabulary: false
#change_vocabulary: true
# Fine-tuning settings: training dataset
finetuning_ds:
manifest_filepath: ???
sample_rate: 16000
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
batch_size: 32
trim_silence: true
max_duration: 16.7
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
# Fine-tuning settings: validation dataset
validation_ds:
manifest_filepath: ???
sample_rate: 16000
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
batch_size: 32
shuffle: false
# Fine-tuning settings: optimizer
optim:
name: novograd
lr: 0.001
I run evaluation before finetuning.
#
speech_to_text evaluate -e specs/speech_to_text/evaluate.yaml -k tlt_encode -m speechtotext_english_quartznet.tlt -r evalution_result_quartznet_tlt test_ds.manifest_filepath=data/an4_converted/test_manifest.json
DATALOADER:0 TEST RESULTS
{'test_loss': 1.9822663068771362, 'test_wer': 0.08408796787261963}
Run finetuning for 100 epochs.
#
speech_to_text finetune -e specs/speech_to_text/finetune.yaml -k tlt_encode -m speechtotext_english_quartznet.tlt -r result_finetune_again_2 finetuning_ds.manifest_filepath=data/an4_converted/train_manifest.json validation_ds.manifest_filepath=data/an4_converted/test_manifest.json trainer.max_epochs=100
Get below result.
#
speech_to_text evaluate -e specs/speech_to_text/evaluate.yaml -k tlt_encode -m result_finetune_again_2/checkpoints/finetuned-model.tlt -r evalution_result_quartznet_tlt_finetune test_ds.manifest_filepath=data/an4_converted/test_manifest.json
DATALOADER:0 TEST RESULTS
{'test_loss': 1.7699419260025024, 'test_wer': 0.05304010212421417}
That means finetuning takes effect and gets better result.