I never quite got around to this, but I did a bit of preliminary testing, and I found that decoding the MP3 actually took something like 99% of the computational time for the function to run (the CPU version of libofa, linked below).
However, I think that it might be possible to decode the whole thing in parallel, on the GPU, if you do some kind of quick reduction/scan of the file (or you can otherwise read the metadata to determine the total length of the file). If you knew the output length, you can calculate the amount of memory to allocate (since each block uses the same amount of data once decoded). Using the scan/reduction step, you could determine a sort of mapping from the blocks in the encoded data to the memory location for each block’s decoded data. From there, you could have each thread decode a single block.
This is pretty high up on my ‘free time to-do list’, since I think it could give a substantial gain to any programs that need to decode mp3 data. Even things like Winamp, iTunes, and so forth could benefit since they could immediately decode the entire song/recording into memory, which makes seeking even faster. Also, the whole reason I wanted to do this project was to write a CUDA-enabled equivalent of the libofa library, to use with the MusicBrainz service. It is also something that could easily use streaming for when you are fingerprinting a lot of files, and will also scale very well to multiple GPU’s (each GPU could work on one file at a time).
I actually just downloaded some open-source code for various mpeg decoders, so maybe I’ll give it shot this weekend.