I use a Jasper15x5dr model for inference using the Nvidia Nemo framework. The results are very reliable, however, the last part of the transcription is always cut off.
Environment used:
→ python 3.8
→ cudatoolkit=10.1
→ nemo-toolkit=10.0
→ additionally installed nemo_toolkit[asr]
OS used:
→ Windows 10 Professional x64 (locally)
→ Linux serverside
What did I attempt to do (quick summary)?
→ Use Nvidia Nemo ASR Jasper 15x5dr model to perform inference on recorded wav files.
What is the problem?
→ The last section of the transcription is always missing. When the next audio file is transcribed, the cut off part is added at the beginning.
Is there another similar topic/issue being reported here on the Nvidia forums?
→ no
I am using Nvidia Nemo Jasper 15x5dr model to transcribe recorded audio files (different input formats, but all are converted to wav 16khz, 1 channel, 16bit in the preprocessing stage). For inference I based the implementation on the Online ASR Microphone Demo provided in the official Nvidia Nemo Github repository (https://github.com/NVIDIA/NeMo/blob/master/examples/asr/notebooks/2_Online_ASR_Microphone_Demo.ipynb). Instead of using PyAudio to transcribe incoming audio streams on-the-fly I receive base64 encoded preprocessed wav files but I make use of the same underlying logic outlined in the transcribe-method which is found in the FrameASR class. I use slight overlapping (too much overlapping resulted in multiples of the same phenoms in the transcription, using a very slight overlap to parse in the wav files nets excellent results for me (discouting the cut off at the end).
Before the signal (i.e. numpy array based on the incoming wav file) is infered by the loaded model in the _decode method, the buffer is shifted to the left to make room for the next frame at the very end. When debugging and checking the buffer as well as the wav files everything seems correct. The model always receives the current frame (small portion of the current wav file) and some overlapping history (the buffer contains some information of the past).
Does anyone use one of the Nvidia Nemo models for inference of recorded audio? If so, did you base your implementation on the concept provided in this example notebook? Did you change any of the buffering? I am hesitant to change this part since I assume the framework performs best like this and this has been provided intentionally like this. I played around with a lot of audio files and all show exactly the same behavior. I am welcome to any hints. The implementation I am using is part of a service and not publicly available but if deemed helpful, I can provide the relevant parts of the implementations.
Any hint/suggestion is highly welcome!