GPU-Accelerated Speech to Text with Kaldi: A Tutorial on Getting Started

Originally published at:

Recently, NVIDIA achieved GPU-accelerated speech-to-text inference with exciting performance results. That blog post described the general process of the Kaldi ASR pipeline and indicated which of its elements the team accelerated, i.e. implementing the decoder on the GPU and taking advantage of Tensor Cores in the acoustic model. Now with the latest Kaldi container on…

I would like to see a link to an article which describes what is needed to use the model in real time.

Do you mean as in streaming audio in real time? How many streams of audio would you have? This is something we are currently working on.

I'm also interested in the about especially in voice related home automation

in the WAV format, shouldn't it be 16bit instead of 32bit float ?