MP3 DEcoder?

I know nVidia had the coding contest before with the CUDA-enabled LAME encoder…but I’m playing with an application now that needs to do MP3 decoding, and given that (from what I’ve read) all MP3 decoders should produce the same output, I wondering if anyone had implemented a CUDA-accelerated MP3 decoder.

If not, does anyone know if this would even be possible/worthwhile?

MP3 decoding is pretty quick as it is… and people usually only need to decode in real time. What’s this for?

Signal analysis on audio data…I’ve got some code that analyzes audio data in a wav file and computes some statistics on it. However, as the program runs now, it has to use LAME to decode the MP3 file to WAV first before the analysis can be run. I’m currently porting the analysis code to CUDA which I think will speed up the process significantly, but I thought that if I could also decode the MP3 on the device, I could get a better speedup since I would need to transfer less data to the card for each file.

I checked out the MP3 format on wikipedia. Apparently it’s encoded as a series of frames, but the frames have to be decoded sequentially (state is kept). I suppose you could use a whole block to cooperatively decode a frame, but there’s not enough meat to split the task across a whole GPU. Easiest thing would be to keep the code single-threaded and decode a few thousand files at once. For you, this may make sense. (I don’t think anyone else would need it, but… :) )

It probably wouldn’t be that hard to take single-threaded CPU source code and convert it to CUDA in that embarassingly parallel way. I’m actually doing something similar with another audio codec.

If this statement were true, there would be no skipping or random access possible when replaying an MP3 file.

I don’t think it’s sequential for the entire file, but sequential for a block of frames. Thus, the block needs to be decoded serially, but the blocks themselves might be able to be decoded in parallel.

alex, I think you’re right about just doing the files that way…what I might do is read a group of files (say, 10-20) at a time onto the card, and decode them as needed for processing. Perhaps that could hide a bit of the transfer latency from host->device.

Good point. Ok, I don’t know how MP3s work. But wikipedia said this: “This sequence of frames is called an elementary stream. Frames are not independent items (“byte reservoir”) and therefore cannot be extracted on arbitrary frame boundaries.”

I don’t know if there’s some unit bigger than the frame that is independent, of if winamp just starts off with a few guessed parameters, plays through a few frames until the errors stabilize, and turns on the audio.

Well, in the trivial CUDA implementation, you’d read in a thousand files (btw, are they all the same length?), decode them all in one kernel run, storing the PCM in the 1GB of DRAM that you have, and then slowly use them.

I need to do CUDA-accelerated MP3 decoder too.
how about your project ? what can be offload to GPU?

I never quite got around to this, but I did a bit of preliminary testing, and I found that decoding the MP3 actually took something like 99% of the computational time for the function to run (the CPU version of libofa, linked below).

However, I think that it might be possible to decode the whole thing in parallel, on the GPU, if you do some kind of quick reduction/scan of the file (or you can otherwise read the metadata to determine the total length of the file). If you knew the output length, you can calculate the amount of memory to allocate (since each block uses the same amount of data once decoded). Using the scan/reduction step, you could determine a sort of mapping from the blocks in the encoded data to the memory location for each block’s decoded data. From there, you could have each thread decode a single block.

This is pretty high up on my ‘free time to-do list’, since I think it could give a substantial gain to any programs that need to decode mp3 data. Even things like Winamp, iTunes, and so forth could benefit since they could immediately decode the entire song/recording into memory, which makes seeking even faster. Also, the whole reason I wanted to do this project was to write a CUDA-enabled equivalent of the libofa library, to use with the MusicBrainz service. It is also something that could easily use streaming for when you are fingerprinting a lot of files, and will also scale very well to multiple GPU’s (each GPU could work on one file at a time).

I actually just downloaded some open-source code for various mpeg decoders, so maybe I’ll give it shot this weekend.

I think it is hard to do huffman decode parallelly on GPU.
I read mp3 decode code from ffmpeg and plan to import some module to GPU.
I think module like imdct can be run well.

yes, this is essentially equivalent to the fsm dwarf, and difficult to do.

I wrote a video compressor, which currently isn’t great performance, but the ideas are probably helpful. It is completely parallel (yes, everything). Unfortunately I don’t have time to make the project presentable, but I did get around to writing some documentation; I hope it’s helpful – And I got a research opportunity with something different, so it’s unlikely I’ll have much time to work on it further.

In a few years, I’d like to try to use GP (genetic programming) to evolve an efficient parallel entropy coding scheme. I currently don’t have much inspiration for this though; IM me [gatoatigrado at gmail] if you’d like to bounce ideas off me.


I’m also working on a hobby project that needs MP3 decoding, so if anyone wants to collaborate on a mp3 decoding library i’m down. As a preface, I’m relatively new to this (both mp3 decoding and cuda), so correct me if I’m wrong.

AFAIK, the byte reservoir just means that frame data may be shifted into the previous frame. This isnt really an issue because we have enough memory to just load an extra padding frame into the shared memory and have each frame look where the offset tells it to look.

Theres also the overlapping granules that everyone’s been hinting at, but I think each one only depends on the previous one and not the entire chain preceding it, so what we can do is just IMDCT into a double-length frame and then do an extra step to “stack” the parts that are supposed to overlap and sum them in parallel (or serial, if each thread does a frame).

It seems like the huffman coding is the divergent and unparallelizable portion. I haven’t really read the standard and verified this, but from what I gather the longest huffman code we have to deal with is only 4 bits long (read: lookup table), and the rest is just a bunch of shifts and additions, so it seems like this particular flavor of huffman coding is pretty constrained and can probably be parallelized quite nicely.