I’m doing the final essay of my CS degree on an algorithm that does compression using the VGA. For that I chose CUDA.
I’m hopping that I can do something nice by the end of the semester. Although I have NO experience on CUDA and a really weak C/C++ skill. (I’m a java developer)
I’m studying ways to do it. I’ve found pigz and I intend to use it as a guide to how I should do it.
If you have an idea of how this algorithm should be implemented please reply.
I’d do it block wise, every multiprocessor crunching on an independent block of input data. Encryption with CUDA would have been way easier, if you ask me. In compression the data dependency leads to lots of branching which kills performance (taking divergent paths within warps is bad). And CUDA isn’t so good with 32 bit integer operations (integer ALU is 24 bit wide). It excels more at floating point. But if you consider lossy data compression algorithms, CUDA may be better suited. These algorithms typically do some signal processing before throwing data away (FFT, DCT, Wavelet, MDCT, you name it).
I would really like to see some more research into this area. Lossless compression seems to be one of the few areas in which cuda falls flat on its face. Is anyone aware of any work into compression/decompression algorithms that can be done in parallel (I am pretty sure that common gzip/bzip/z7 are incredibly hard). If these don’t exist, it might make sense to go back to the drawing board and start designing new algorithms…
The idea is to make a lose-less algorithm. In my idea I think it might be possible to do the compression at the same amount of time. I’m not sure if i’ll be able to do that. But if I can’t at least having an software that doesn’t uses the CPU while you are compressing/decompressing is always nice to have.
cbuchner1 thanks for the tips.
CapJo I’ll be reading the links you send me. Thanks
Here is another link to the same topic. IBM had already the idea to do compression in parallel and they are also using a Lempel-Ziv-Algorithm (like in the other links I posted).
Patent issues aside, arithmetic coding based on floats could be a worthwhile topic for investigation.
Traditionally CPU based algorithm used fixed point integer implementations (usually on 16 bit ints),
but on the GPU floats might be better performing.
The main challenge would be to emit the coded bits in a way that does not cause too many divergent
branches among threads (and use shared memory a lot to store the bits until you have gathered enough
bits to write them out to global memory in a coalesced way).
there are methods to use particular rounding modes in CUDA (not sure if it’s only available within PTX or if there are also compiler intrinsics). By choosing a suitable mode the behavior is known.
renormalization of the interval at known times can bring the numbers back into suitable ranges (we definitely need to stay away from denormals)
when compression and decompression are both happening on CUDA devices, no problems should occur. If the CPU will decompress, then we need to make sure the rounding on the CPU works the same as on the CUDA device.
It’s been a long time since I’ve last played with arithmetic coding myself.