has anyone made any of those work with gpu acceleration?
wouldnt that be sweet?
has anyone made any of those work with gpu acceleration?
wouldnt that be sweet?
Compression is pretty much a serial algorithm with dependencies on previous decisions, I don’t think there is much potential for parallelizing a compression algorithm except maybe for gathering some statistics about the source data.
Algorithms that perform blockwise independent compression might benefit a bit by distributing these blocks to various SMs on the GPU. But still the question remains how to fully occupy the various threads (up to 768 or 1024) within an SM - without causing branch divergence.
Don’t some of the compression algorithms use Huffman coding? I don’t know how much CPU time is spent on that part of the compression, but that should be easily parallelized (AFAIK).
Huffman code is a variable length code, so the position of each codeword in the final bit stream depends on previous coding decisions. So I can’t see how this would work on the GPU.
it can work by outputting at the maximum separation and afterwards removing empty space with a scan algorithm. I believe there is even a function in CUDPP.
up to today compression algorithms has been thought to work on a single CISC cpu, but this doesn’t mean that we can’t have better, faster algorithms on GPGPU.
For simplicity, think a file of n^2 bits as a A=n*n square matrix (eventually filled with zeros at the end).
A stupid way to do this is to divide A in smaller a(i) square matrix, then apply a traditional compression to every matrix. The problem is that huffman based algorithms are more efficient on big matrix, so you could lose the beneficial effect by choosing too big a(i) square matrix.
Therefore we have to use wiser algorithms, which usually wouldn’t work that good on a classic GPU. My first idea is to useSVD decomposition, here’s a short remark:
then we have the interesting part:
Summarized it means that starting with a square matrix a(i) of dimension m*m, with m<<n, SVD decomposition is like takin ipersphere U of dimension M and ray 1, and trasform it using the matrix SIGMA of SVD in an iperellipsoid V.
The orientation of V is written in U, it means that we have a versor v of dimension m centered in the origin, that gives one direction in the space, and it is one of the axis of U. We can calculate easly the other axis by finding the j(=m-1) vectors v(j) such as:
<v, v(j)> = 0
We map the vector v using SIGMA and we find one vector k, which is one direction (axis) of V. We can find the other j directions by:
So, given a submatrix a(i) of dimension m*m, can compress it in
a ) A vector of dimension m with the SVD
b ) A vector of dimension m with one direction in the iperspace
Therefore the original matrix A=n*n is compressed in n/m vectors of dimension 2m, therefore the whole matrix is compressed in a 2n vector. HUGE HUGE HUGE RESULT!!!
Remember that nowdays iterative method for the computation of SVD of an A=n*n matrix, takes a linear time to have an error smaller than the rounding error of an IEEE compliant processing unit.
So not only you compress more, but you also do it faster.
Moreover this is a lossyless algorithms, you can have even better resluts with lossy algorithms.
Search google for ‘parallel huffman coding’ and you’ll see some good academic paper results. This one in particular looked like it might work on CUDA:
While most compression in indeed serial (values depend on previous history), you can still do all kinds of tricks to “fix things up” in parallel computes. The simplest dumb splitting can be quite effective: say you have 600MB of data to compact (with Huffman, arithmetic coding, LZW, etc). You can compact 600 streams of 1MB each independently in parallel and join them. The compression will NOT be as effective as a single stream (since any particular block can’t steal references from previous blocks) but in practice, depending on algorithm and block size, the compression inefficiency is not outrageous. And in fact if you do a multi-pass algorithm, you can even fix much of this up by taking a compressed stream and “recompressing” it to use the offsets known from the previous compression… this gets trickier. You can also get more compression ratio by using deeper searching in each independent block (especially for LZW), perhaps enough to offset the loss of efficiency by the lack of history.
It’s probably not a bad CUDA contest idea. No too hard for a simple starting implementation, but the process can be tweaked and tuned forever. CPUs have fat local L2 caches to make the lookup tables fast, though, and I wonder if that would be a bottleneck on the GPU.
hmmm. texture cache?
For arithmetic or Huffman decoding, this is exactly what you’d use.
For LZW (or most adaptive algorithms), unfortunately you need to be able to update the table continuously as you travel the stream… neither decoding or encoding use a fixed table like a CUDA texture.
It’s still possible of course, and part of the challenge is how you’d do it. Tiny MRU caches in shared memory? Tons of kernel launches with texture rebinding between them? Use lots of math to represent string fragments by hashes, trading off space for compute? It’s interesting…
You could keep the tables in coallesced global mem. It’d work very well if you do even a few (eg 10) ops for each read/write. If you can’t do that right away, you can think about the inner loop of the algorithm on how to get a little more data reuse and utilize an smem cache.
Btw, would suddenly having a multi-megabyte MRU help compression efficiency much?
Yeah, everyone agrees a multi-megabyte semi-local memory would just be awesome. For decompression, you probably would have a MRU of the last few K decoded, plus a lookup dictionary (defined by the compression algorithm) which would be dynamically updated. The fast cache would be great for this. Compression gives you more options since you get to choose the exact strategy for building dictionaries and how far back you’ll use stream references. In both cases, you may want both an MRU cache as well as a dictionary (and that dictionary would probably have both static and dynamic entries.)
I would expect that Nvidia will add control over such a layer of memory, since it’s so useful. Intel’s done a surprisingly solid job with Larrabee’s cache/local RAM… it does standard automatic HW caching of both reads and writes to global memory, BUT it also allows streaming of data with fast eviction if you like. It allows you to issue prefetch hints. (All of those features are like CPU caches.) What’s new for Larrabee is it allows interprocessor communication (you can choose to keep all or part of the L2 memory coherent between cores or not, automatically synchronized by the intercore ring buffer). And finally, perhaps most usefully, you can disable even parts of the cache and use it as your own private fast RAM scratchpad. It’s awfully versatile.
NV’s hardware cache is mostly optimized for caching texture reads. It’s uncertain how much hardware level control is possible over this RAM… is it entirely hardware controlled only for read caching? [That’s all that’s exposed to the CUDA programming model.] There’s not much shared with us so we have to guess… even the cache size is unclear, along with lots of details (is the constant cache shared with texture cache? Are fetches done in multi cache lines for 2D textures? Can the cache be disabled to keep certain addresses “locked” without eviction?)
The problem with cache control is that it’s powerful but it’s hard for programmers. The Cell SPE cache is pretty well designed, but it’s just HARD to keep track of this extra “dimension” of control, and even Cell coders struggle. Sort of like in CUDA, we’re doing regular coding, but have to add the extra juggling of the concepts of blocks and warps, coalesced RAM and bank conflicts. Adding yet more complexity in terms of RAM hinting, locking, core syncronization is intimidating. [But I still want it!]
Those are some cool details re Larabee. Certainly putting a cache on global memory is a great thing, and to be able to trade it off with shared mem is also great. That’s strange, though, that you’re allowed to prefetch. On a GPU you’re not supposed to need to prefetch because latency gets hidden automatically, not manually. I hope the one, big, coherent cache won’t increase latency so much that it can’t be hidden automatically.
I wonder if/when NVIDIA is planning to put this kind of gmem cache on its GPUs.
Well, an L2 cache hides memory latency for you… instead of 200 clocks to get global memory, you can read and write from the L2 in 20. (or whatever, I just made those numbers up, but it’s clearly an order-of-magnitude thing.)
Larrabee isn’t a GPU. You’re exactly right that GPUs hide latency by having lots of threads in flight. But Larabee’s CPU-ish core has more complexity, and therefore it’s harder to juggle multiple thread contexts. A G200 SM can have up to 32 warps all in flight at once… a Larrabee core can only have 4 simultaneous threads (each comparable to one CUDA warp).
The two designs are both extremely interesting and have tradeoffs in very different domains.
Luckily I don’t see why the next NV GPU couldn’t add similar read/write cache, almost certainly by just generalizing the texture cache hardware. I wouldn’t even mind if NV skipped the fancy intercommunication used by Larrabee cores to keep their respective L2s coherent with each other (that’s a tricky job for hardware, I’d prefer the KISS design with more power and less elegance)
That cache would need to become big enough to have a positive effect for up to 32 * 32 * 3 = 3072 threads per TPC. I do not know how many floats you normally want to have in the cache per thread, but it adds up to 12 kB per float value. As far as I remember, they have 24 kB now, so that would add up to 2 float values per thread (or better with <100% occupancy)
I wholeheartedly agree with your KISS remark. I find the fact I cannot tweak a lot of things in CUDA a feature. I cannot tweak for me means something less to do wrong ;)
What what wahaaat? Larabee isn’t massively hyper-threaded? That’s the whole fn point of the architecture! The 21st century demon of latency slayed by time-sliced parallelism. Does it even have warps? Did they make it superscalar, too? Garbaaage lol
That’s tricky. The current global mem is cache-coherent, and also supports atomics. I guess you can get around that by supporting true “global” access only on non-cached portions. Then you’re really back to the texture cache. But yeah, nvidia can certainly expand it and make it easier to use. (ie without any of that texture resource bs.) It could also allow writes to it, even if they’re dangerous. Honestly, a shared, coherent cache would be tricky to implement but not so bad on performance. Its biggest impact would be latency. But the GPU paradigm laughs at latency. (And laughs at Intel. wtf)
Yeah, cache is really tricky in that respect. The biggest bundle of memory on the GPU, by far, is registers. And what are you gonna do, have a cache as big as you have registers? It’s so silly vs what a cache is on a CPU. But sometimes threads do share the same memory. It’d be nice to see the smem reconfigurable from a scratchpad into an automatic cache.
i keep on reading different values… Shared memory, or cache, isn’t 16kb for 8x and 9x series, and 32 kb for gt200?
shared mem = 16 kb for all architectures.
amount of constant cache = 16 kb per SM as far as I know.
amount of texture cache is as far as I know 16 kb per TPC for <GT200, 24 kb for GT200, but that is from a vague memory, I don’t have the programming guide handy, and that one speaks of texture cache per SM (while it actually is per TPC as far as I understand the hardware)