Speech recognition challenge

We’re looking for GPGPU based speech recognition technology for an unusual application.

One of our customers has an audio archive comprising almost 50,000 hours of material, almost all of which is from one speaker, over a 40+ year period.

The speaker uses a rather limited Englsih vocabulary, around 1200 words (according to one hour-long sample) along with roughly 100 or so specialized non-English words.

The goal is to generate transcripts which are close enough that a human editor, familar with the specialized vocabulary and speech of the lecturer, can “touch up” the transcripts in, say, 30% of real time.

There are two uses of the transcripts: firstly, for closed captions, and secondly, as input for subsequent semantic analysis and abstraction of topics.

The audio quality, from the human perspective, is quite good, but the recordings are made in a variety of venues, using different microphones and so forth. Most of the lectures were given in small to medium-sized rooms, with perhaps 50-200 persons in the audience. Fortunately, very little was recorded outdoors.

Presumably the technology allows creation of custom acoustic, speech and language modeling, which are all optimized for this one speaker. The inputs, so to speak, at the audio files, and manually-prepared, accurate transcripts as training materials.

We’d be interested in hearing from researchers active in this field, and exploring development of what’s required to accomplish this.

I don’t think CUDA is strictly required for this job.

A lot of CPU based voice recognition toolkits are doing pretty well and have been on the market for >10 years, being constantly improved over time. A couple of commercial solutions are available and you should be able to train such software for this particular speaker and his particular vocabulary.

However if such a training sequence requires the speaker to repeat particular phrases, then you might have a problem - assuming that this person is no longer available. ;)

The problem is that commercial applications, at least at the consumer level, are very sensitive to the speaker, acoustical background, microphone and so forth. And they aren’t fast enough to work at large scale.

Exactly. The “input” materials are the audio files, and the transcriptions, that’s it. We can’t go back to the original speaker.

You could ask these guys:
They have a noise-robust and reasonably fast speech recognition system (used in a commercial setting by www.soundintel.com).
It is not GPU-based, although their system could be recoded for GPGPU. I know this, because I’ve helped in developing their early code based ;-)

I’m also an alumnus here ;-)

Who’s the best person to contact there? Sounds like a good resource.

I think Tjeerd Andringa is the key member here. He is the program director and a researcher in this field for many years:


And with a CC to Professor Dr. Schomaker: