winrar, winzip or 7zip on GPU like the topic says

g000fy · September 22, 2008, 4:46am

has anyone made any of those work with gpu acceleration?

wouldnt that be sweet?

cbuchner1 · September 22, 2008, 9:18am

Compression is pretty much a serial algorithm with dependencies on previous decisions, I don’t think there is much potential for parallelizing a compression algorithm except maybe for gathering some statistics about the source data.

Algorithms that perform blockwise independent compression might benefit a bit by distributing these blocks to various SMs on the GPU. But still the question remains how to fully occupy the various threads (up to 768 or 1024) within an SM - without causing branch divergence.

jack · September 22, 2008, 11:21am

Don’t some of the compression algorithms use Huffman coding? I don’t know how much CPU time is spent on that part of the compression, but that should be easily parallelized (AFAIK).

cbuchner1 · September 22, 2008, 1:15pm

Huffman code is a variable length code, so the position of each codeword in the final bit stream depends on previous coding decisions. So I can’t see how this would work on the GPU.

E.D_Riedijk · September 22, 2008, 2:15pm

it can work by outputting at the maximum separation and afterwards removing empty space with a scan algorithm. I believe there is even a function in CUDPP.

mascarpone · September 22, 2008, 3:53pm

up to today compression algorithms has been thought to work on a single CISC cpu, but this doesn’t mean that we can’t have better, faster algorithms on GPGPU.

For simplicity, think a file of n^2 bits as a A=n*n square matrix (eventually filled with zeros at the end).

A stupid way to do this is to divide A in smaller a(i) square matrix, then apply a traditional compression to every matrix. The problem is that huffman based algorithms are more efficient on big matrix, so you could lose the beneficial effect by choosing too big a(i) square matrix.

Therefore we have to use wiser algorithms, which usually wouldn’t work that good on a classic GPU. My first idea is to useSVD decomposition, here’s a short remark:

Because U and V are unitary, we know that the columns u1,…,um of U yield an orthonormal basis of Km and the columns v1,…,vn of V yield an orthonormal basis of Kn (with respect to the standard scalar products on these spaces).

The linear transformation T :Kn â†’ Km that takes a vector x to Mx has a particularly simple description with respect to these orthonormal bases: we have T(vi) = Ïƒi ui, for i = 1,…,min(m,n), where Ïƒi is the i-th diagonal entry of Î£, and T(vi) = 0 for i > min(m,n).

The geometric content of the SVD theorem can thus be summarized as follows: for every linear map T :Kn â†’ Km one can find orthonormal bases of Kn and Km such that T maps the i-th basis vector of Kn to a non-negative multiple of the i-th basis vector of Km, and sends the left-over basis vectors to zero. With respect to these bases, the map T is therefore represented by a diagonal matrix with non-negative real diagonal entries.

then we have the interesting part:

To get a more visual flavour of singular values and SVD decomposition â€”at least when working on real vector spacesâ€” consider the sphere S of radius one in Rn. The linear map T maps this sphere onto an ellipsoid in Rm. Non-zero singular values are simply the lengths of the semi-axes of this ellipsoid. Especially when n=m, and all the singular values are distinct and non-zero, the SVD decomposition of the linear map T can be easily analysed as a succession of three consecutive moves : consider the ellipsoid T(S) and specifically its axes ; then consider the directions in Rn sent by T onto these axes. These directions happen to be mutually orthogonal. Apply first an isometry v* sending these directions to the coordinate axes of Rn. On a second move, apply an endomorphism d diagonalized along the coordinate axes and stretching or shrinking in each direction, using the semi-axes lengths of T(S) as stretching coefficients. The composition d o v* then sends the unit-sphere onto an ellipsoid isometric to T(S). To define the third and last move u, just apply an isometry to this ellipsoid so as to carry it over T(S). As can be easily checked, the composition u o d o v* coincides with T.

Summarized it means that starting with a square matrix a(i) of dimension m*m, with m<<n, SVD decomposition is like takin ipersphere U of dimension M and ray 1, and trasform it using the matrix SIGMA of SVD in an iperellipsoid V.

The orientation of V is written in U, it means that we have a versor v of dimension m centered in the origin, that gives one direction in the space, and it is one of the axis of U. We can calculate easly the other axis by finding the j(=m-1) vectors v(j) such as:

<v, v(j)> = 0

We map the vector v using SIGMA and we find one vector k, which is one direction (axis) of V. We can find the other j directions by:

<k, k(j)>=0

So, given a submatrix a(i) of dimension m*m, can compress it in

a ) A vector of dimension m with the SVD

b ) A vector of dimension m with one direction in the iperspace

=2*m

Therefore the original matrix A=n*n is compressed in n/m vectors of dimension 2m, therefore the whole matrix is compressed in a 2n vector. HUGE HUGE HUGE RESULT!!!

mascarpone · September 22, 2008, 3:58pm

Post scriptum:

Remember that nowdays iterative method for the computation of SVD of an A=n*n matrix, takes a linear time to have an error smaller than the rounding error of an IEEE compliant processing unit.

So not only you compress more, but you also do it faster.

Moreover this is a lossyless algorithms, you can have even better resluts with lossy algorithms.

jack · September 22, 2008, 4:42pm

Search google for ‘parallel huffman coding’ and you’ll see some good academic paper results. This one in particular looked like it might work on CUDA:

[url=“http://portal.acm.org/citation.cfm?id=1257297.1257781&coll=GUIDE&dl=GUIDE&CFID=81714693&CFTOKEN=91953531”]http://portal.acm.org/citation.cfm?id=1257...FTOKEN=91953531[/url]

SPWorley · September 22, 2008, 5:42pm

While most compression in indeed serial (values depend on previous history), you can still do all kinds of tricks to “fix things up” in parallel computes. The simplest dumb splitting can be quite effective: say you have 600MB of data to compact (with Huffman, arithmetic coding, LZW, etc). You can compact 600 streams of 1MB each independently in parallel and join them. The compression will NOT be as effective as a single stream (since any particular block can’t steal references from previous blocks) but in practice, depending on algorithm and block size, the compression inefficiency is not outrageous. And in fact if you do a multi-pass algorithm, you can even fix much of this up by taking a compressed stream and “recompressing” it to use the offsets known from the previous compression… this gets trickier. You can also get more compression ratio by using deeper searching in each independent block (especially for LZW), perhaps enough to offset the loss of efficiency by the lack of history.

It’s probably not a bad CUDA contest idea. No too hard for a simple starting implementation, but the process can be tweaked and tuned forever. CPUs have fat local L2 caches to make the lookup tables fast, though, and I wonder if that would be a bottleneck on the GPU.

cbuchner1 · September 22, 2008, 5:51pm

hmmm. texture cache?

SPWorley · September 22, 2008, 6:06pm

For arithmetic or Huffman decoding, this is exactly what you’d use.

For LZW (or most adaptive algorithms), unfortunately you need to be able to update the table continuously as you travel the stream… neither decoding or encoding use a fixed table like a CUDA texture.

It’s still possible of course, and part of the challenge is how you’d do it. Tiny MRU caches in shared memory? Tons of kernel launches with texture rebinding between them? Use lots of math to represent string fragments by hashes, trading off space for compute? It’s interesting…

alex_dubinsky · September 24, 2008, 3:09am

You could keep the tables in coallesced global mem. It’d work very well if you do even a few (eg 10) ops for each read/write. If you can’t do that right away, you can think about the inner loop of the algorithm on how to get a little more data reuse and utilize an smem cache.

Btw, would suddenly having a multi-megabyte MRU help compression efficiency much?

SPWorley · September 24, 2008, 3:48am

Yeah, everyone agrees a multi-megabyte semi-local memory would just be awesome. For decompression, you probably would have a MRU of the last few K decoded, plus a lookup dictionary (defined by the compression algorithm) which would be dynamically updated. The fast cache would be great for this. Compression gives you more options since you get to choose the exact strategy for building dictionaries and how far back you’ll use stream references. In both cases, you may want both an MRU cache as well as a dictionary (and that dictionary would probably have both static and dynamic entries.)

I would expect that Nvidia will add control over such a layer of memory, since it’s so useful. Intel’s done a surprisingly solid job with Larrabee’s cache/local RAM… it does standard automatic HW caching of both reads and writes to global memory, BUT it also allows streaming of data with fast eviction if you like. It allows you to issue prefetch hints. (All of those features are like CPU caches.) What’s new for Larrabee is it allows interprocessor communication (you can choose to keep all or part of the L2 memory coherent between cores or not, automatically synchronized by the intercore ring buffer). And finally, perhaps most usefully, you can disable even parts of the cache and use it as your own private fast RAM scratchpad. It’s awfully versatile.

NV’s hardware cache is mostly optimized for caching texture reads. It’s uncertain how much hardware level control is possible over this RAM… is it entirely hardware controlled only for read caching? [That’s all that’s exposed to the CUDA programming model.] There’s not much shared with us so we have to guess… even the cache size is unclear, along with lots of details (is the constant cache shared with texture cache? Are fetches done in multi cache lines for 2D textures? Can the cache be disabled to keep certain addresses “locked” without eviction?)

The problem with cache control is that it’s powerful but it’s hard for programmers. The Cell SPE cache is pretty well designed, but it’s just HARD to keep track of this extra “dimension” of control, and even Cell coders struggle. Sort of like in CUDA, we’re doing regular coding, but have to add the extra juggling of the concepts of blocks and warps, coalesced RAM and bank conflicts. Adding yet more complexity in terms of RAM hinting, locking, core syncronization is intimidating. [But I still want it!]

alex_dubinsky · September 25, 2008, 7:13am

Those are some cool details re Larabee. Certainly putting a cache on global memory is a great thing, and to be able to trade it off with shared mem is also great. That’s strange, though, that you’re allowed to prefetch. On a GPU you’re not supposed to need to prefetch because latency gets hidden automatically, not manually. I hope the one, big, coherent cache won’t increase latency so much that it can’t be hidden automatically.

I wonder if/when NVIDIA is planning to put this kind of gmem cache on its GPUs.

SPWorley · September 25, 2008, 8:28am

Well, an L2 cache hides memory latency for you… instead of 200 clocks to get global memory, you can read and write from the L2 in 20. (or whatever, I just made those numbers up, but it’s clearly an order-of-magnitude thing.)

Larrabee isn’t a GPU. You’re exactly right that GPUs hide latency by having lots of threads in flight. But Larabee’s CPU-ish core has more complexity, and therefore it’s harder to juggle multiple thread contexts. A G200 SM can have up to 32 warps all in flight at once… a Larrabee core can only have 4 simultaneous threads (each comparable to one CUDA warp).

The two designs are both extremely interesting and have tradeoffs in very different domains.

Luckily I don’t see why the next NV GPU couldn’t add similar read/write cache, almost certainly by just generalizing the texture cache hardware. I wouldn’t even mind if NV skipped the fancy intercommunication used by Larrabee cores to keep their respective L2s coherent with each other (that’s a tricky job for hardware, I’d prefer the KISS design with more power and less elegance)

E.D_Riedijk · September 25, 2008, 4:16pm

That cache would need to become big enough to have a positive effect for up to 32 * 32 * 3 = 3072 threads per TPC. I do not know how many floats you normally want to have in the cache per thread, but it adds up to 12 kB per float value. As far as I remember, they have 24 kB now, so that would add up to 2 float values per thread (or better with <100% occupancy)

I wholeheartedly agree with your KISS remark. I find the fact I cannot tweak a lot of things in CUDA a feature. I cannot tweak for me means something less to do wrong ;)

alex_dubinsky · September 25, 2008, 5:24pm

What what wahaaat? Larabee isn’t massively hyper-threaded? That’s the whole fn point of the architecture! The 21st century demon of latency slayed by time-sliced parallelism. Does it even have warps? Did they make it superscalar, too? Garbaaage lol

That’s tricky. The current global mem is cache-coherent, and also supports atomics. I guess you can get around that by supporting true “global” access only on non-cached portions. Then you’re really back to the texture cache. But yeah, nvidia can certainly expand it and make it easier to use. (ie without any of that texture resource bs.) It could also allow writes to it, even if they’re dangerous. Honestly, a shared, coherent cache would be tricky to implement but not so bad on performance. Its biggest impact would be latency. But the GPU paradigm laughs at latency. (And laughs at Intel. wtf)

alex_dubinsky · September 25, 2008, 6:13pm

Yeah, cache is really tricky in that respect. The biggest bundle of memory on the GPU, by far, is registers. And what are you gonna do, have a cache as big as you have registers? It’s so silly vs what a cache is on a CPU. But sometimes threads do share the same memory. It’d be nice to see the smem reconfigurable from a scratchpad into an automatic cache.

mascarpone · September 25, 2008, 8:25pm

i keep on reading different values… Shared memory, or cache, isn’t 16kb for 8x and 9x series, and 32 kb for gt200?

E.D_Riedijk · September 25, 2008, 9:00pm

shared mem = 16 kb for all architectures.
amount of constant cache = 16 kb per SM as far as I know.
amount of texture cache is as far as I know 16 kb per TPC for <GT200, 24 kb for GT200, but that is from a vague memory, I don’t have the programming guide handy, and that one speaks of texture cache per SM (while it actually is per TPC as far as I understand the hardware)

Topic		Replies	Views
Optimizing Data Transfer Using Lossless Compression with NVIDIA nvcomp Technical Blog	5	968	January 11, 2021
Data Compression on GeForce CUDA Programming and Performance	9	9233	May 5, 2012
Starting a compression algorithm powered by CUDA any help is appreciated CUDA Programming and Performance	9	11183	April 9, 2010
Ultimate Data Compression Algorithm PAQ on CUDA CUDA Programming and Performance	8	5529	December 9, 2008
What are you guys doing with cuda? just wanna find a way to go CUDA Programming and Performance	81	56807	February 7, 2013
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	211178	January 18, 2014
nvCOMP - get compressed data from device GPU-Accelerated Libraries nvcomp	17	1102	April 19, 2024
CUDA + 7zip = OH GOD! seriously! CUDA Programming and Performance	6	18410	December 3, 2008
Can CUDA run a Windows .exe file/program ? CUDA Programming and Performance	3	7284	February 10, 2012
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75934	February 9, 2009

winrar, winzip or 7zip on GPU like the topic says

Related topics