I tried first time
Basically, what I want to achieve is to transcribe
wav file to text. I achieve this, but I am interesting in to get
metadata as well.
Which word started on which seconds ?
This feature is implemented in
deepspeech/vosk ? Do we have something in
Nvidia Nemo ?
Maybe I missed something.
Nemo is a framework to build applications which could do what you describe, it is not a ready to run application.
You can find out more details about Nemo , and tutorials on how to build applications and use some of the pre-trained models on our developer site : NVIDIA NeMo | NVIDIA Developer
Best of luck with your project - and welcome to the NVIDIA Developer Community