Hello,
I tried first time Nvidia Nemo
.
Basically, what I want to achieve is to transcribe wav
file to text. I achieve this, but I am interesting in to get metadata
as well.
For instance,
Which word started on which seconds ?
This feature is implemented in deepspeech/vosk
? Do we have something in Nvidia Nemo
?
Maybe I missed something.
Thanks !
Nemo is a framework to build applications which could do what you describe, it is not a ready to run application.
You can find out more details about Nemo , and tutorials on how to build applications and use some of the pre-trained models on our developer site : https://developer.nvidia.com/nvidia-nemo
Best of luck with your project - and welcome to the NVIDIA Developer Community