hi, i’m interested to the same kind of problem, about gpu compression. Have you find any suitable solution or have you implemented a version of cuda bzip? As starting point i find an interesting paper about parallel version of bzip at this link http://gilchrist.ca/jeff/papers/Parallel_BZIP2.pdf
Yes, there’s only 16kb of shared memory per block, which is then divided equally by the threads in the block. Another route (if your algorithm will work with it) may be to store some of your data in texture memory, which is faster than global memory for read-only access.
You don’t need to store the whole block in smem to use smem. Also since each thread will be doing its own block, you should actually be thinking of using registers not smem.
I’m not sure what you mean by “having to do fine-grained parallelization”, but i’d start with an implementation that uses global memory in a coalesced way. Just using coalesced global memory should still give you a speedup, since GPU’s DRAM is faster than a CPU’s L2, and the effective latencies are much less. Then I would see where there is small-scale data reuse, and move that into in-register caches.
Also don’t use texture memory. It’s not faster if you’ve got coalescing working.