GPU-Accelerated Speech to Text with Kaldi: A Tutorial on Getting Started

Originally published at:

Recently, NVIDIA achieved GPU-accelerated speech-to-text inference with exciting performance results. That blog post described the general process of the Kaldi ASR pipeline and indicated which of its elements the team accelerated, i.e. implementing the decoder on the GPU and taking advantage of Tensor Cores in the acoustic model. Now with the latest Kaldi container on…

I would like to see a link to an article which describes what is needed to use the model in real time.

Do you mean as in streaming audio in real time? How many streams of audio would you have? This is something we are currently working on.

1 Like

I'm also interested in the about especially in voice related home automation

in the WAV format, shouldn't it be 16bit instead of 32bit float ?

Hi, I would like to know if the real-time streaming option is out yet ? If not, when is this going to be supported.


Yes, streaming is now fully supported. You can find more details there:


Assuming this forum is appropriate to discuss KALDI implementation issues. If not, I apologize.

I hit a roadblock when trying to use KALDI for a corpus of english-spanish language data using this code which seems to be taylored to Chinese.
More details in this status report There is a paragraph ‘discontinuing the project’ explaining the data preparation issue. I would appreciate any help on this