Production Inference Path for Fine-Tuned Canary-v2 (TensorRT or RIVA Support)

Dear all,

We are Dharma-AI, an AI startup, that currently developing an automatic speech-to-text transcription product, in which we are fine-tuning NVIDIA’s Canary-v2 model for Portuguese, with a specific focus on the vocabulary and terminology of the Brazilian legal domain. The fine-tuning processes are already underway, and we are finalizing a first version of the model to begin our inference testing pipeline in a production-like environment.

For this stage, we would like to leverage an NVIDIA inference SDK, such as TensorRT or RIVA, aiming to ensure performance, scalability, and alignment with the NVIDIA ecosystem. However, throughout our research and practical experimentation, we have encountered the following technical limitations:

TensorRT: as far as we have been able to verify, there is currently no direct support for running speech-to-text models such as Canary-v2 in the .nemo format. To use it with TensorRT, it would be necessary to convert the model to ONNX; however, so far, we have not found official support for exporting fine-tuned .nemo models to .onnx.

RIVA: we understand that RIVA provides more native support for ASR models, but it requires the model to be in the .riva format. Although there is the nemo2riva tool for converting .nemo to .riva, our tests indicate that this tool does not support the Canary-v2 architecture, making this path infeasible at the moment.

Given this scenario, we would appreciate clarification from the community and/or the NVIDIA technical team on the following points:

  • Is there currently (or planned in the near future) an official and supported path to run fine-tuned Canary-v2 models in production using TensorRT or RIVA?

  • If not, what would be the best practice recommended by NVIDIA for deploying this type of model in production?

We thank you in advance for any guidance or technical references that could help us in this deployment workflow.