Nvidia Nemo cuts last part of transcription

Fairwell · June 23, 2020, 2:25pm

I use a Jasper15x5dr model for inference using the Nvidia Nemo framework. The results are very reliable, however, the last part of the transcription is always cut off.

Environment used:
→ python 3.8
→ cudatoolkit=10.1
→ nemo-toolkit=10.0
→ additionally installed nemo_toolkit[asr]

OS used:
→ Windows 10 Professional x64 (locally)
→ Linux serverside

What did I attempt to do (quick summary)?
→ Use Nvidia Nemo ASR Jasper 15x5dr model to perform inference on recorded wav files.

What is the problem?
→ The last section of the transcription is always missing. When the next audio file is transcribed, the cut off part is added at the beginning.

Is there another similar topic/issue being reported here on the Nvidia forums?
→ no

I am using Nvidia Nemo Jasper 15x5dr model to transcribe recorded audio files (different input formats, but all are converted to wav 16khz, 1 channel, 16bit in the preprocessing stage). For inference I based the implementation on the Online ASR Microphone Demo provided in the official Nvidia Nemo Github repository (https://github.com/NVIDIA/NeMo/blob/master/examples/asr/notebooks/2_Online_ASR_Microphone_Demo.ipynb). Instead of using PyAudio to transcribe incoming audio streams on-the-fly I receive base64 encoded preprocessed wav files but I make use of the same underlying logic outlined in the transcribe-method which is found in the FrameASR class. I use slight overlapping (too much overlapping resulted in multiples of the same phenoms in the transcription, using a very slight overlap to parse in the wav files nets excellent results for me (discouting the cut off at the end).

Before the signal (i.e. numpy array based on the incoming wav file) is infered by the loaded model in the _decode method, the buffer is shifted to the left to make room for the next frame at the very end. When debugging and checking the buffer as well as the wav files everything seems correct. The model always receives the current frame (small portion of the current wav file) and some overlapping history (the buffer contains some information of the past).

Does anyone use one of the Nvidia Nemo models for inference of recorded audio? If so, did you base your implementation on the concept provided in this example notebook? Did you change any of the buffering? I am hesitant to change this part since I assume the framework performs best like this and this has been provided intentionally like this. I played around with a lot of audio files and all show exactly the same behavior. I am welcome to any hints. The implementation I am using is part of a service and not publicly available but if deemed helpful, I can provide the relevant parts of the implementations.

Any hint/suggestion is highly welcome!

Topic		Replies	Views
Develop Smaller Speech Recognition Models with NVIDIA’s NeMo Framework Technical Blog	11	921	November 8, 2022
Nemo Trained model not giving transcript when deployed on jarvis both offline and streaming Riva nemo , riva	6	1003	September 8, 2021
JARVIS throwing errors for offline ASR when using own model Riva riva	12	2843	September 4, 2021
Nvidia Nemo training throws PicklingError Deep Learning (Training & Inference)	3	1724	October 12, 2021
Segmentation Fault while loading Deepstream Yolo model on Jetson Nano DeepStream SDK jetson-inference	7	1693	October 26, 2021
Fine Tune the hind Nvidia Nemo Riva inception	25	1653	January 25, 2023
DeepStream nvinfer input tensor contains incorrect image DeepStream SDK jetson-inference , gstreamer	14	1389	August 8, 2022
Finetuning Nemo Model Frameworks nemo	3	797	November 14, 2024
Nvidia jetson detectnet increasing latency Jetson Nano jetson-inference , ai	9	1652	October 15, 2021
Failed to convert Nemo model to Riva (nemo2riva) - ASR Riva nemo	4	1119	May 31, 2023

Nvidia Nemo cuts last part of transcription

Related topics