STT get metadata (start time/duration)

Hello,

I tried first time Nvidia Nemo.

Basically, what I want to achieve is to transcribe wav file to text. I achieve this, but I am interesting in to get metadata as well.

For instance,
Which word started on which seconds ?

This feature is implemented in deepspeech/vosk ? Do we have something in Nvidia Nemo ?

Maybe I missed something.

Thanks !

Nemo is a framework to build applications which could do what you describe, it is not a ready to run application.
You can find out more details about Nemo , and tutorials on how to build applications and use some of the pre-trained models on our developer site : https://developer.nvidia.com/nvidia-nemo

Best of luck with your project - and welcome to the NVIDIA Developer Community