I was wondering if it might be possible to offload even the huffman step of JPEG decoding onto the GPU, so I’ve had a few ideas.
Previous parallelization attempts have tried to decode the JPEG using several (e.g. 4) CPU cores, aiming for resynchronization during decoding. In contrast to this, I intend to brute-force it by assigning one CUDA thread to each bit offset in the JPEG file for the huffman decoding.
Let’s assume we have a sequentially coded, monochrome JPEG (no U,V color channels) without any restart markers between the MCUs (minimal coded units, i.e. 8 x 8 DCT blocks).
My idea breaks down as follows. First we guess the location where any MCUs might start. Then we brute-force decode all possible codewords within the next 512 bits. Then we gather only the valid codewords in each MCU. And finally we sequentially throw out incorrectly decoded or overlapping MCUs.
Here are the steps in more detail:
After the “Start of Scan” marker (skipping the JFIF header, quantization tables and such), search the encoded data for End of Block bit sequences, i.e. the sequence “1010”. We assume each EOB marks the beginning of a new 8x8 pixel MCU. At most one EOB per byte of input data shall be detected, so each CUDA thread can scan one different byte offset in the file. A lookup table determines if the following eleven bits contain a candidate (why eleven? Assuming the sequence starts at the seventh bit we need to look at three more bits from the following byte). We record the respective MCU offsets in the file in some list. The rate of false positives will definitely be high (given the shortness of the EOB huffman sequence), but let us ignore this problem for now.
Launch thread blocks, where each block starts looking at one MCU candidate in the file. The first thread decodes the DC codeword using the respective huffman tree. All other threads decode AC codewords according the AC specific huffman trees. Each thread gets to start at a different bit offset in the file (e.g. offset by threadIdx.x bits relative to the MCU candidate position). Decoding is done entirely through table lookup (no bit shifting magic required). We’re going to decode a lot of nonsensical codewords in most of the threads. Each thread stores
-the number of bits in the codeword
-the number of zeros in the run length coding in case of AC.
-the quantized coefficient value (DC or AC)
NOTE: the 512 thread limit within a CUDA thread block limits this approach to decode 512 bits per MCU, unless we are going to decode more than one codeword per thread.
Launch thread blocks, where each warp scans one candidate MCU sequentially (we use one active warp per thread to avoid warp-divergence). Gather the codewords: First DC, then a sequence of AC until EOB. We skip all previously incorrectly decoded codewords because we now know the length of each valid codeword. Mark a MCU candidate as invalid and terminate the thread when we find more than 64 coefficients per block. Record the position of the EOB markers terminating each valid block.
In a sequential scan we need to drop/ignore these MCUs that did not begin at a position following the a valid EOB marker of a previous block (overlapping decodes). We collect all valid MCUs and gather the respective DC+AC coefficients.
Now we can perform zig-zag reordering, dequantization and IDCT (all well-known and previously implemented in CUDA by others).
So overall I think this approach will be very much bandwidth limited, we’re going to perform a lot of redundant decoding but there will be no branch divergence within warps. Hopefully the tables for huffman decoding will fit into shared memory. The required sequential operations (scans) will be limiting our throughput because the memory access will be quite random. But these sequential scans of 3) can happen in parallel in many warps utilizing all SMs.
Long huffman codewords may be a problem for table based decoding, we will have to see how to solve this.
Adding color support to this approach would immensely complicate things.
I think I may take a shot at this. I want a CUDA’ized JPEG decoder that leaves no work for the CPU. For grayscale this will work - but is there any chance that such brute-forcing can beat or at least match the CPU?
Comments welcome, even if it’s a “dude, no chance”.