Parallel huffman decoding in JPEG brute forcing it, one thread per input bit ;-)

cbuchner1 · June 27, 2009, 8:49pm

Hi,

I was wondering if it might be possible to offload even the huffman step of JPEG decoding onto the GPU, so I’ve had a few ideas.

Previous parallelization attempts have tried to decode the JPEG using several (e.g. 4) CPU cores, aiming for resynchronization during decoding. In contrast to this, I intend to brute-force it by assigning one CUDA thread to each bit offset in the JPEG file for the huffman decoding.

Let’s assume we have a sequentially coded, monochrome JPEG (no U,V color channels) without any restart markers between the MCUs (minimal coded units, i.e. 8 x 8 DCT blocks).

My idea breaks down as follows. First we guess the location where any MCUs might start. Then we brute-force decode all possible codewords within the next 512 bits. Then we gather only the valid codewords in each MCU. And finally we sequentially throw out incorrectly decoded or overlapping MCUs.

Here are the steps in more detail:

After the “Start of Scan” marker (skipping the JFIF header, quantization tables and such), search the encoded data for End of Block bit sequences, i.e. the sequence “1010”. We assume each EOB marks the beginning of a new 8x8 pixel MCU. At most one EOB per byte of input data shall be detected, so each CUDA thread can scan one different byte offset in the file. A lookup table determines if the following eleven bits contain a candidate (why eleven? Assuming the sequence starts at the seventh bit we need to look at three more bits from the following byte). We record the respective MCU offsets in the file in some list. The rate of false positives will definitely be high (given the shortness of the EOB huffman sequence), but let us ignore this problem for now.
Launch thread blocks, where each block starts looking at one MCU candidate in the file. The first thread decodes the DC codeword using the respective huffman tree. All other threads decode AC codewords according the AC specific huffman trees. Each thread gets to start at a different bit offset in the file (e.g. offset by threadIdx.x bits relative to the MCU candidate position). Decoding is done entirely through table lookup (no bit shifting magic required). We’re going to decode a lot of nonsensical codewords in most of the threads. Each thread stores
-the number of bits in the codeword
-the number of zeros in the run length coding in case of AC.
-the quantized coefficient value (DC or AC)
NOTE: the 512 thread limit within a CUDA thread block limits this approach to decode 512 bits per MCU, unless we are going to decode more than one codeword per thread.
Launch thread blocks, where each warp scans one candidate MCU sequentially (we use one active warp per thread to avoid warp-divergence). Gather the codewords: First DC, then a sequence of AC until EOB. We skip all previously incorrectly decoded codewords because we now know the length of each valid codeword. Mark a MCU candidate as invalid and terminate the thread when we find more than 64 coefficients per block. Record the position of the EOB markers terminating each valid block.
In a sequential scan we need to drop/ignore these MCUs that did not begin at a position following the a valid EOB marker of a previous block (overlapping decodes). We collect all valid MCUs and gather the respective DC+AC coefficients.
Now we can perform zig-zag reordering, dequantization and IDCT (all well-known and previously implemented in CUDA by others).

So overall I think this approach will be very much bandwidth limited, we’re going to perform a lot of redundant decoding but there will be no branch divergence within warps. Hopefully the tables for huffman decoding will fit into shared memory. The required sequential operations (scans) will be limiting our throughput because the memory access will be quite random. But these sequential scans of 3) can happen in parallel in many warps utilizing all SMs.

Long huffman codewords may be a problem for table based decoding, we will have to see how to solve this.

Adding color support to this approach would immensely complicate things.

I think I may take a shot at this. I want a CUDA’ized JPEG decoder that leaves no work for the CPU. For grayscale this will work - but is there any chance that such brute-forcing can beat or at least match the CPU?

Comments welcome, even if it’s a “dude, no chance”.

Christian

cvnguyen · June 28, 2009, 4:42am

This is a try-and-false approach. Assuming that your algorithm could be well parallelized, the speed-up gained would be small in the sense that false/try ratio is so high.

cvnguyen · June 28, 2009, 5:26am

One more thing: your algorithm may be dependent on the executing order of blocks in a grid. As far as I know, unlike the executing order of threads in a block, this order is not clarified in CUDA specification.

Topic		Replies	Views
JPEG 2000 decoding with CUDA CUDA Programming and Performance	7	5731	February 15, 2024
JPEG encoding with CUDA CUDA Programming and Performance	7	10516	January 31, 2011
Opening an image file direct from GPU CUDA Programming and Performance	1	1493	October 22, 2008
Feasibility of High-Performance Online Huffman Decoding + GEMM Fusion on GPU CUDA Programming and Performance	12	297	September 1, 2025
lossless JPEG fast decompression on CUDA CUDA Programming and Performance	7	17099	May 3, 2012
MP3 DEcoder? CUDA Programming and Performance	12	11317	May 30, 2009
Data Compression on GeForce CUDA Programming and Performance	9	9299	May 5, 2012
Video decoding of streaming JPEG 2000 ( MP2 ) ? Is CUDA appropriate processing for streaming video d CUDA Programming and Performance	4	5362	November 9, 2008
Jpeg decoding result using NPP library GPU-Accelerated Libraries	6	1267	January 22, 2025
Any performance data on using parallelizable parts of jpeg compression CUDA Programming and Performance	0	727	September 24, 2010

Parallel huffman decoding in JPEG brute forcing it, one thread per input bit ;-)

Related topics