Deployments of custom-trained ASR models result in empty transcript results


I am trying to train and deploy a custom ASR model with Riva. I have been able to train and evaluate Citrinet models with NeMo, but I had trouble deploying them and decided to see if I could have better results following the linked tutorial notebooks’ steps closely:

I can get through the steps in the training notebook reasonably well, but once I try to actually deploy a Quartznet 15x5 model that I custom-trained, I find that I get empty results to transcription requests sent to the server. Output from an offline request:

results {
  channel_tag: 1

Sometimes, I get an “audio_processed” field included in the results, but the duration shown is crazy small (this is for a ~45 second file, note the e-41 at the end):

results {
  channel_tag: 1
  audio_processed: 4.5830867574207467e-41

For streaming queries, I just get nothing back.

The Nvidia-provided models work fine, both when I launch them by setting the appropriately parts of and running the quickstart scripts. Likewise, I was able to successfully download a .tlt for a Quartznet model from the Nvidia Catalog, export that to a .riva, build that to a .rmir, and use riva-deploy.

Given the export, build, and deploy steps I followed were successful in deploying the pretrained Quartznet model, I imagine the problem is in my training process. To train, I have been pretty much just following the steps and commands outlined in the linked Catalog notebook.

I also tried following the fine-tuning step to tune an Nvidia-provided .tlt with custom data, and the resulting model gave the same issue as the models I trained from scratch.

For context, the custom models I’ve trained in that notebook with the sample data all produce nothing (or just a single character) in the inference step and have very high loss and WER. I imagine that the tutorial parameters and sample data would have been selected so as to produce a model that at least pulls out a word or two, so something seems to be going wrong.

Does anyone have any advice? I can provide whatever additional information is needed.

Hardware: AWS g4dn.xlarge instance, with a T4 GPU
Operating System: Ubuntu 20.04 LTS via NVIDIA GPU Cloud image
Riva Version - 1.7
TLT Version (if relevant) - 3.21.08

1 Like

I’ve run through training with the example scripts provided in the NeMo package and tried launching with the build commands as per the documentation (not the notebooks) and have had the same issues with the output being a similar object with only “channel_tag” and “audio_processed” sometimes showing up as well but only with a small value similar to the example above.

Tried several different versions, different model training processes, different environments, different command line arguments e.g. with and without --offline but still haven’t been able to get results from the deploy process unless it was from a model I directly downloaded from NVIDIA. Would really appreciate figuring this one out so we can use the models we’ve trained up!