AWS g4dn.xlarge T4 16GiB
Curious about what exactly is the change from “streaming offline mode” to “true offline mode” with the 1.8b release, and why the change. Can I recreate the previous offline mode (prior to 1v.8beta) by simply using the streaming inference model with similar chunk sizes?
Is this change in batch processing documented? I did not see it in the release notes.
@rleary Could you offer some insight, please? Thanks to your previous assistance we were able to get really nice long form transcriptions with the increase gRPC message sizes and the older “streaming offline mode”.
09:58:40.630826 346 grpc_riva_asr.cc:447] ASRService.Recognize called.
09:58:40.660948 346 riva_asr_stream.cc:213] Detected format: encoding = 1 numchannels = 1 samplerate = 16000 bitspersample = 16
09:58:40.662431 346 grpc_riva_asr.cc:519] ASRService.Recognize performing streaming recognition with sequence id: 31625700
09:58:40.662487 346 grpc_riva_asr.cc:537] Using model citrinet-1024-en-US-asr-offline for inference
09:58:40.662544 346 grpc_riva_asr.cc:552] Model sample rate= 16000 for inference
09:58:40.662636 346 grpc_riva_asr.cc:583] Error: Audio duration (1376.962524s) is longer than maximum supported audio duration in offline mode (900.0s)
Notice also here that it says "RService.Recognize performing streaming recognition with sequence id: "
If I’m correct that should now say offline recognition, correct?
Hi @ShantanuNair. Sorry for the trouble this has caused you - I agree we can make the release notes more clear about this change. I’ll get that updated. Good catch on the log message as well.
We did, indeed, change the behavior of the offline API in this release. With the Jasper/QuartzNet/CitriNet model family, there is always at least some accuracy degradation when performing streaming inference (including the previous ‘streaming offline’ implementation). In some cases after fine-tuning, this degradation can be drastic - we continue to research this behavior. By switching to true batch processing, we recover this lost accuracy. In a future release, we will also be able to increase the throughput of the offline endpoint due to these changes.
Depending on your deployment configuration (number of models, GPU memory), you may be able to increase the maximum input size by regenerating the RMIR and re-deploying. We documented the procedure to replicate our model deployments in this release: Riva — NVIDIA Riva. Note the CitriNet Offline portion of the table, specifically
--chunk_size=900 --left_padding_size=0. --right_padding_size=0.. You can modify the
chunk_size to meet your requirements, assuming the model will continue to fit in memory. We will be looking into opportunities to minimize the disruption in future releases as well (e.g. running VAD and splitting the input audio if necessary).
Hope this helps. Please let me know if there’s any other information we can provide to assist you with your deployment.