JPEG 2000 decoding with CUDA

Hi everyone.

Im currently trying to add support for digital cinema packages (DCP) to one of my video playback tools. The DCP image data is a JPEG 2000 code stream at a resolution of 2048x1080 with 3 colour channels and a 12bit precision per channel.
Decoding on the CPU (Intel Xeon X5570) using the OpenJPEG library takes about 700ms per frame and I need to get that down to at least 40ms.

While there’re quite a few papers and open source projects out there that focus on GPU accelerated JPEG 2000 en-coding, I couldn’t find anything on de-coding it. Even the commercial Kakadu JPEG 2000 SDK doesn’t seem to support GPU acceleration.
It also seems that (at least when using the OpenJPEG libraries) most of the time is NOT being spent doing the inverse discrete wavelet transform (IDWT) but rather on the decoding of the entropy encoded bit stream.

Before I embark on trying to implement something myself, does anyone know of a reason why this hasn’t been done before? (And if it has, could someone please point me in the right direction?)
Or is there some inherent property of the JPEG 2000 arithmetic coder that prevents parallelisation? (Maybe EN-coding was just the sexier research project ;) )

Thanks in advance to anyone that can help me.


I am not too familiar with JPEG 2000, however I’ve done some research on draft versions of H.264 in the past (including the arithmetic coding)

In general, the state of an arithmetic coder changes with each decoded symbol. So it’s inherently serial. There might be a chance to parallelize decoding across independently coded elements of the picture (if there are any such independently decodable streams).

Maybe parallelizing across multiple frames on one GPU could also be an option. Why decode frame by frame ;)


Thanks for your help, Christian. I’m also still new to all this, but I’ll start by trying to identify any independent code blocks in the data stream.



What about jpeg or mjpeg decoding? is it possible? I didn’t find much related to that…

Accelerating IDCT -> that is no problem.

The main issue preventing parallel decoding is that you don’t know where individual coded DCT blocks

start in the data stream, unless synchronization markers were added during encoding.

That pretty much rules out accelerated decoding of most JPEG images that were created with standard



thanks cbuchner1!

one more thing: nvidia primitives claim to perform those kind of operations (encode / decode) in jpeg files… do you know how to use them?

thks ;)

High performance JPEG and JPEG2000 codecs exist on GPU:
CUDA JPEG2000 Codec

As far as concerns fast DCP processing and viewing, there is MXF Player which is capable of J2K decoding on GPU in realtime:
MXF Player