bzip on CUDA Is it enough memory?

I want to port a parallelized version of bzip to CUDA to compare the scalability of CPU vs GPU. After reading the CUDA manual i doubt that this is feasible.

bzip compresses blocks of 100kb to 900kb. It doesn’t really work out to parallelize the compression of one block, but you can compress many blocks in parallel.

The DSK deviceQuery reports 16kb of shared memory per block. So i can’t even put on bzip block into one CUDA block?

The solution would be to use global instead of shared memory? This means i have to do fine grained parallelization and rewrite the Burrows-Wheeler transformation etc.?

hi, i’m interested to the same kind of problem, about gpu compression. Have you find any suitable solution or have you implemented a version of cuda bzip? As starting point i find an interesting paper about parallel version of bzip at this link

Yes, there’s only 16kb of shared memory per block, which is then divided equally by the threads in the block. Another route (if your algorithm will work with it) may be to store some of your data in texture memory, which is faster than global memory for read-only access.

You don’t need to store the whole block in smem to use smem. Also since each thread will be doing its own block, you should actually be thinking of using registers not smem.

I’m not sure what you mean by “having to do fine-grained parallelization”, but i’d start with an implementation that uses global memory in a coalesced way. Just using coalesced global memory should still give you a speedup, since GPU’s DRAM is faster than a CPU’s L2, and the effective latencies are much less. Then I would see where there is small-scale data reuse, and move that into in-register caches.

Also don’t use texture memory. It’s not faster if you’ve got coalescing working.

You never can use all of the 16 kB, because the kernel call parameters are stored in shared memory also, so there always is just a bit less than 16 kB available for your algorithm…


I am interested in trying to do what you are talking about. I would like to know if any one of you have already done it or if you have heard of something about the topic.

Thank you very much.

There is a thread around here somewhere in which the impossible was achieved - full use of 16kb memory.

You have to transfer the parameters and some special variables to registers first - then it works (*)

*of course it is an evil hack


I was asking about running bzip on cuda. Did anyone try it ? I could not find any links to successful ports, so I assume that if there are, they are kept from public (for whatever reason).


Hello guys.

Im doing my final assay on this topic.

As I havent started any development only the theory part if someone has any code that I can start with or compare with it would be really apreciated.


It’s not bzip, but I proposed an answer to a similar question here .

Snippet below:



Thx for the info… I asked permission to join the group.

But anyways… have you took time to try to implement this software?

Nope, I’m working on another GPU project right now but think this would be a short and fun effort for someone to tackle.