Production Inference Path for Fine-Tuned Canary-v2 (TensorRT or RIVA Support)

Dear all,

We are Dharma-AI, an AI startup, that currently developing an automatic speech-to-text transcription product, in which we are fine-tuning NVIDIA’s Canary-v2 model for Portuguese, with a specific focus on the vocabulary and terminology of the Brazilian legal domain. The fine-tuning processes are already underway, and we are finalizing a first version of the model to begin our inference testing pipeline in a production-like environment.

For this stage, we would like to leverage an NVIDIA inference SDK, such as TensorRT or RIVA, aiming to ensure performance, scalability, and alignment with the NVIDIA ecosystem. However, throughout our research and practical experimentation, we have encountered the following technical limitations:

TensorRT: as far as we have been able to verify, there is currently no direct support for running speech-to-text models such as Canary-v2 in the .nemo format. To use it with TensorRT, it would be necessary to convert the model to ONNX; however, so far, we have not found official support for exporting fine-tuned .nemo models to .onnx.

RIVA: we understand that RIVA provides more native support for ASR models, but it requires the model to be in the .riva format. Although there is the nemo2riva tool for converting .nemo to .riva, our tests indicate that this tool does not support the Canary-v2 architecture, making this path infeasible at the moment.

Given this scenario, we would appreciate clarification from the community and/or the NVIDIA technical team on the following points:

  • Is there currently (or planned in the near future) an official and supported path to run fine-tuned Canary-v2 models in production using TensorRT or RIVA?

  • If not, what would be the best practice recommended by NVIDIA for deploying this type of model in production?

We thank you in advance for any guidance or technical references that could help us in this deployment workflow.

Thanks for sharing request with TensorRT team!

TensorRT will release a seamless model import tool on TensorRT Incubator repo (target mid/late April). With this TensorRT Model Connect, you can directly import Canary-v2 to TRT with 3 lines of commands, no ONNX anymore. Please wait for the upcoming release and techBlog for more information.

Thanks for the answer! When the release is published, will you send a message here? Our tuned model is ready to test with it. Right now, we are using NeMo for inference.

Yes! I will share it here once we release and look forward to your feedback :)

Hi Gabriel,

Thanks for your patience!

we are developing a new feature called TensorRT Model Connect that directly addresses this.

Instead of forcing you through a fragile PyTorch → ONNX → inference pipeline, TensorRT Model Connect acts as a porting funnel that lets you go directly from PyTorch to C++ in just 1-2 commands. For your specific use case, it includes tailored APIs for speech workloads, allowing you to pass audio clips directly into the C++ runtime and receive transcripts using transcribe_batch(...).

Early Access & Feedback This feature is currently in an experimental incubator phase and is not yet public. We are sharing the private repo with developers willing to test it on their specific models and partner with us by providing direct, candid feedback.

If you are interested in testing this out for your Canary v2 deployment, please reply or send me a DM with your email, and I will share the private repo with you.

Thanks a lot!