Accelerating PNG-Compression using CUDA possible?

Someone experience with this topic yet? We’re developing a web server application in which we accelerate the rendering already with 7900 GTX cards. How ever about 70% of the graphic generation time is currently spent in the png-compression routines using the png/z-lib.

I unfortunately didn’t find the time to start experimenting with CUDA yet and as far as I’ve informed myself till now CUDA promises a very big performance increase for simple functions such as transformations etc…

So my question:
Is it possible to write a CUDA based PNG compression algorithm to reduce our current time of about 5-10ms (dependent on complexity) for a 256x256 big graphic to lets say 1ms?


There may be a small possibility. However, you may need to spend a lot time to make it happen.
If you can afford to use other compression schemes, I guess jpeg or bzip2 may be easier.

Thanks for your fast response. As you already proposed I also believe that there’s at least for JPGs a big potential as the compression is blockwhise in the first step and using lots of float operations as well. Unfortunately JPGs would blur our texts too much, see Map.

And the compressed image has to be IE6 compatible, so a 32-Bit PNG with alpha channel.

I can imagine that there’s a little potential in the line comparison part of the PNG compression, but I think the far more hurting part is the deflate step which offers very less parallelization possibilities I guess.

Someone any idea ?


So PNG uses nothing like a DCT, wavelet transform or other paralellizable full-image transform? It just pushes the image through zlib?

If compressing one image is not sufficiently paralellizable, one idea might be (if applicable in your case) to use the hardware to compress multiple images at once.

A potentially limiting factor here will be how much scratch space zlib requires per compression stream. The default memory footprint (according to is 256 kB, which won’t even fit in the shared memory. The zlib parameters can be tuned to reduce the temporary memory required, or the scratch space could be stored in global memory. Since memory access to the scratch space is unlikely to be in any coalesced pattern, the performance hit for using global memory could be pretty big.

It really does sound like CUDA is a poor fit for PNG compression.

Well, I think any improvement would only be achievable via a completely redesigned parallel algorithm for DEFLATE, not just implementing zlib.
For IE6, maybe you could write a bzip decompresser in JavaScript or something?

but zlib is already incredibly cheap on the CPU… is that really the bottleneck?

Yes, it unfortunately is. To your question “It just pushes the image through zlib?”:

No, it simplifies the data before through comparing the lines with a couple of relative simple algorithms to each other from top to bottom… linewhise. This simplified data is then compressed using deflate. The required time scales from lets say 5 to 15ms dependent on the compression level you set in the zlib… for the same image.

Because of this I think that the most time is spent in the function searching for pixel-repetitions… so looking in the “dicitionary”. The bigger the dictionary is the more time is spent but the better the compression is because of more hits.

So one way I can imagine would be to cut the image into sub images… lets say sixteen 256x16 big images… and this parallel for all 4 channels… so using 64 threads to build 64 indpenedent dictionaries using CUDA to seek these repetitions. And at the end you merge these dictionaries for each channel, build the huffman tree etc…

I guess the time till x-mas will be a bit more quiet… so time to play a bit with the zlib… I’ll let you know when I found a way.