We’re looking for GPGPU based speech recognition technology for an unusual application.
One of our customers has an audio archive comprising almost 50,000 hours of material, almost all of which is from one speaker, over a 40+ year period.
The speaker uses a rather limited Englsih vocabulary, around 1200 words (according to one hour-long sample) along with roughly 100 or so specialized non-English words.
The goal is to generate transcripts which are close enough that a human editor, familar with the specialized vocabulary and speech of the lecturer, can “touch up” the transcripts in, say, 30% of real time.
There are two uses of the transcripts: firstly, for closed captions, and secondly, as input for subsequent semantic analysis and abstraction of topics.
The audio quality, from the human perspective, is quite good, but the recordings are made in a variety of venues, using different microphones and so forth. Most of the lectures were given in small to medium-sized rooms, with perhaps 50-200 persons in the audience. Fortunately, very little was recorded outdoors.
Presumably the technology allows creation of custom acoustic, speech and language modeling, which are all optimized for this one speaker. The inputs, so to speak, at the audio files, and manually-prepared, accurate transcripts as training materials.
We’d be interested in hearing from researchers active in this field, and exploring development of what’s required to accomplish this.