I know that compression is not a good candidate for parallel programming.
But in my case, I need to compress chunks of 32KB data blocks. Each data block
is unrelated and should be compressed separately.
There are performance limits on doing this on a regular CPU. Is the GPU worth evaluating.??
Secondly, which compression algorithms should be good in this case? LZ77, LZW etc
or maybe even ‘burrow wheeler transform’ followed by entropy encoding.
If compression is a good candidate, is it a good idea to split work among 32 threads of a warp.
by this i mean, 30 threads fetch data into the shared memory, and 2 threads do the actual compression!!
(Atleast techniques like this could improve memory latency issues??!)
You might want to start with a simple algorithm, for example with a static huffman compression (for a known distribution of input symbol probabilities). The difficult part is how to generate a single bit stream from many parallel threads.
A first CUDA kernel could determine all the code words in a parallel fashion (This can be done with a simple table lookup) and a second kernel could generate a compact bit stream from this. Ideally the multiprocessors would each be assigned one 32k block, so the overall speed scales with the amount of multiprocessors processors you have.
Investigating arithmetic coding on the GPU could be a very interesting topic as well. Taking into account that the nvidia GPUs are faster in floating point arithmetic than in 32 bit integer arithmetic, the probability intervals could be expressed as floats instead of ints.
Again, the problem remains that it will be difficult to generate a bit stream from the input data. Maybe each thread could generate its own bit stream, concatenating them all at the end. But you would also need to write an index for the decoder, describing the offsets where each thread’s bit stream is located in the compressed file. Otherwise there would be no way to decode it in a parallel manner.
This seems to be a general problem with bit streams written by a serial encoder. Without an index describing the location of independently decodable blocks, a parallel decoder is almost impossible to implement efficiently. For example decoding a JPEG image on a GPU, exploiting parallelism is hard. There is almost no way to tell where the JPEG’s minimum coded units (the 8x8 DCT blocks) are located in the bit stream without having decoded all previous MCUs. One could try to guess, but it wastes computation cycles whenever a wrong guess was made.
So a parallel algorithm designed for a GPU would have to include some hiting, indicating where the independently decodable units start (sync markers would work, or an index at the beginning of the file).
If you use huffman compression, juste putting the Huffman table on Constant and then sending 256 blocks/ Multiprocessor, INTERLEAVED 256X to enable coalesced read, and you are done. 8MB DATA per Multiprocessor will go up to 240MB for a GTX 285 so seems any implementation will handle that, and with coalesced READS, you will use 100% of your CPU cycles.
Just think about cocatenating your output per 128bits blocks within registers to write them down not too often.
You may use the macro-threading idea to have a separated warp (or two) that will write-back the compressed data to Global Memory :-)
Ah, that’s a good idea to let each thread generate its very own stream instead of trying to concatenate the thread output with some sort of “stream compaction”, as I had intended
I disagree on the use of constant memory though - it will exhibit suboptimal performance when different threads access different locations. Maybe a use of texture memory might be better for the code tables? The most probable codewords should be spatially close together in the texture. Using a 2D texture would allow to use 2nd order huffman coding, for example for text compression where you have certain pairs of letters that occur more frequently than others.
I think that the right approach for compression on parallel architectures would involve a from-scratch redesign of the algorithms involved.
The core of lossless data compression is identifying redundancy in the data. This is done using a combination of clever schemes for run length encoding and huffman encoding.
For example, for the LZ family: “Many algorithms forward scan the input buffer and match it against a dictionary it maintains. It will scan into the buffer until it cannot find a match in the dictionary. At this point it will output the location of the word in the dictionary, if one is available, the match length and the character that caused a match failure. The resulting word is then added to the dictionary.”
For a parallel algorithm, the basic principles would remain the same. You would still want to identify and exploit redundancy in the data. However, you would probably want to make the identification of redundant data decomposable and hierarchical. Maybe start by building up intermediate results (word/pattern/phrase frequencies) in parallel, then barrier, share the generated dictionaries, and identify redundancy in the dictionaries rather than in the input data. Repeat multiple stages of this until you get diminishing returns.
Blocking schemes will not be sufficient here, redundant information will need to be identified and shared among blocks. We really need two subproblems to be solved:
Building a dictionary in parallel.
Performing entropy encoding (alphabet generation) in parallel.
The quality of the dictionaries and alphabets generated need to be on par with existing sequential algorithms, and the algorithms need to be scalable to millions of threads. This is will be the most significant result, and clever schemes (e.g. never materializing the dictionary) will follow.
There is obviously a lot of work that needs to be done. We need an expert on this topic or someone to put in enough time to become an expert. NVIDIA, Intel, AMD, etc should step up and fund a multi-year thesis on this topic.