Has there been any updates wrt the roadmap of Diarization? Riva marketing material has stated for a while that it will be supported, so I have my fingers crossed that it will be out. Thought I’d ask here since Titanet is out and made it to NeMo.
I’ve been testing the titanet_large model and using Riva’s ASR timestamps to perform diarization on ASR output from Riva. But since diarization isn’t end to end trainable, it seems tedious to get it set up and running performantly. I considered popping the model into the triton model repo used by Riva, but not sure if that’s recommended or even feasible.
I’d appreciate any direction or insight on how this could be architected — right now I’m looking at running a separate Triton instance with multiple model replicas sharing a GPU and having VAD + diarization run via the python backend to squeeze out some performance. Of course it would be great if it was baked into Riva itself, but until then :)
On a related note - the 128 token limit on the Punctuation Model makes it extra difficult to use the time stamps provided with the transcripts for performing voice embedding since we do not have any unique id to tie a word hypothesis to its timestamp. I’m manually adding in the cut-short unpunctuated segments of speech from the word timestamps to the end of each request’s final transcript whenever the timestamps word list and final transcript’s word list lengths do not match.
Anyone have any updates on this? This was addressed in the recent keynote but still isn’t available. What better channel do I have to communicate with Nvidia teams, when I, as an active member on the forum and elsewhere, just don’t get any responses or feel heard. If there’s a better way to get updates on this, I’m all ears :) #inception