Maxwell texture compression under CUDA

One big weakness of the new Maxwell GPUs is their low memory bandwidth due to their short memory bus (256 bits at most). In order to alleviate its impact memory compression techniques are employed.

The question is if the memory compression hardware is exposed to CUDA. If yes how can one make use of it?

to what end? to compress (resident) (application) data?

would the hardware not be ‘passive’ or ‘hard-wired’, like ecc circuitry?

Yes, for application data in order to virtually increase bandwidth. In the same sense the graphics apps use it via texture mapping.

What is known about this texture compression? Is it lossy or lossless? Is it a transparent hardware feature or under programmer control (is it exposed via an OpenGL extension, for example)?

"One big weakness of the new Maxwell GPUs is their low memory bandwidth due to their short memory bus (256 bits at most). In order to alleviate its impact memory compression techniques are employed.

The question is if the memory compression hardware is exposed to CUDA. If yes how can one make use of it?"

i have not spent much time on the intrinsics of maxwell, as i am under the deep impression that the true maxwell workhorses are still on the way, and thus still preoccupied with kepler architecture

but, if i follow ekon correctly, ekon is noting that, by compressing the data, fewer bus lines - a shorter memory bus - can transfer the same amount of data than a wider memory bus, give or take
further, ekon seems to point out that maxwell makes use of such a design premise

if this is the case, i would much be of the impression that this would be embedded circuitry, like ecc would be embedded circuitry; likely between device memory and the caches; and not under the programmer’s control
cuda oracle njuffa, you should know best; i am slightly (much) speculating here

the alternative then would be to look at software compression of application data, especially given greater shared memory of maxwell, and idle processing capacity in many cases
but i see little use if the data is already hardware compressed - it would imply a double compression, and the second compression would have poor results

My questions regarding the Maxwell texture compression were not rhetorical in nature. My experience with Maxwell is limited to playing with a Maxwell-based GPU for a few hours shortly before I departed NVIDIA. I certainly cannot lay claim to a status of “CUDA oracle” :-)

If the compression is lossless and transparent (e.g. like ECC, as you say), it would appear that there is nothing for a CUDA programmer to do or worry about, as any benefits would be accrued automatically; further discussion could be moot. However, I simply do not know whether the Maxwell compression mentioned by the original poster is of that nature, thus my questions.

In some instances the larger caches on Maxwell may make up for the limited memory bandwidth, but as I have expressed before the limited memory bandwidth of Maxwell GPUs would appear to be exacerbating a problem with CUDA-based apps becoming more and more memory bound in recent years, as growth in memory bandwidth has not kept pace with the amazing growth in FLOPS.

The GTX980 whitepaper refers to “lossless compression techniques”.

One possibility of the implementation I see is having compressed data on device memory and reading them using hardware decompression. However, utilizing writeable device memory, e.g. global memory, compression would be impractical as memory stores on compressed data would be terribly slow. The only case I see is by applying hardware decompression on read-only textures bound on CUDA arrays and not on linear memory. Copying data to a CUDA array could implicitly compress data whereas texture fetching could lead to decompressing them. Of course these are just personal thoughts.

Would it be beneficial for CUDA apps? I think the answer is dependent on the problem and data nature.

“The GTX980 whitepaper refers to “lossless compression techniques”.”

my thinking is that it absolutely has to be lossless, otherwise one would never again recognize one’s own data; soon enough, what you get out is not what you put in, and in many a case you would actually care about this

“However, utilizing writeable device memory, e.g. global memory, compression would be impractical as memory stores on compressed data would be terribly slow.”

i do not follow the logic; hence, i can not concur; does ecc slow down memory operations? i doubt

“The only case I see is by applying hardware decompression on read-only textures bound on CUDA arrays and not on linear memory.”

i do not see how hardware compression would differ between textures and linear memory

Lossy texture compression schemes such as S3TC have certainly been used in OpenGL for many years, thus my question. As long as textures are all about visible pixels and not data for general purpose computation, lossy texture compression can be perfectly acceptable.

ECC as implemented on NVIDIA GPUs does have a negative performance impact on memory bandwidth. The reason is that the ECC information is transported in-band, meaning some of the bandwidth otherwise available to user data is taken up by ECC bits. The performance penalty has been reduced from Fermi to Kepler and again from Kepler to Maxwell. ECC on CPUs is typically implemented as out-of-band by adding one additional bit for every byte, and widening the memory interface accordingly.

I find the whitepaper very uninformative as to how the compression actually works. I looked for OpenGL extensions that expose this compression but could not find any. This may imply that this feature is “just there” and not under programmer control or that I don’t know how to search OpenGL extensions :-)

CUDA array structures are typed so even in case of lossy compression it could be enabled if the array consists of floats. In case of lossless compression it could be enabled regardless of the specified type.

indeed, you and c.o. njuffa are awfully right
njuffa brilliantly makes the distinction between all/ general purpose data, and texture/ pixel data
for a fleeting moment i could see compression being applied to all data, just as host memory and other system buses started moving from lanes to ‘channels’ to reduce crosstalk
but a more careful study of the whitepaper seems to impress that it is mostly confined to texture data

given the data characteristics of pixels/ textures, the compression mechanisms they reference and thus use make perfect sense, and i can understand why it is lossless - it is ‘cheap’ and ‘quick’, with a guaranteed bang

an interesting thing is that they equally exploit (colour) layers, and this has me thinking
in many a case, i have kernels reading multiple input arrays; perhaps these too can be layered and so compressed

yes, greed: that distinct frown forming on one’s face, when you realize your device is running at half capacity, and it is due to (coalesced) global memory access…

I performed benchmarking of texture memory fetching bound on CUDA arrays. Unfortunately, I didn’t notice any effective bandwidth measurements that could imply an implicit compression of data. Texture delta compression seems to be either not enabled on CUDA kernels or implemented in software when needed by a graphic API (e.g. D3D).

a)

“or implemented in software”

i am still much of the opinion that the compression is hardware implemented, as opposed to software implemented, in order to reduce effect on memory latency
a hardware implementation may imply a few clocks, a software implementation may imply a number of instructions, in turn implying many more clocks
software implementation would likely also imply dedicated, additional processors/ processing to be dropped somewhere

and note this excerpt from the whitepaper:

“The bandwidth savings from this compression is realized a second time when clients such as the Texture Unit later read the data.”

you would then need software processing close to such texture units as well

b)

not all data may be compressible

again from the whitepaper:

As illustrated in the preceding figure, our compression engine has multiple layers of compression algorithms. Any block going out to memory will first be examined to see if 4x2 pixel regions within the block are constant, in which case the data will be compressed 8:1 (i.e., from 256B to 32B of data, for 32b color). If that fails, but 2x2 pixel regions are constant, we will compress the data 4:1.

In this mode, we calculate the difference between each pixel in the block and its neighbor, and then try to pack these different values together using the minimum number of bits. For example if pixel A’s red value is 253 (8 bits) and pixel B’s red value is 250 (also 8 bits), the difference is 3, which can be represented in only 2 bits.

Finally, if the block cannot be compressed in any of these modes, then the GPU will write out data
uncompressed, preserving the lossless rendering requirement.

evidently, to be compressible, the data must satisfy some of the above conditions/ stipulations

c)

cuda texture fetches would go through texture cache, and then there are the sm texture units:

The GPU’s dedicated hardware Texture units are a valuable resource for compute programs with a need
to sample or filter image data. (kepler whitepaper)

i do not know whether compression would be enabled for these paths as well; i suppose it might be: if textures are compressed before pushing them into memory, then any read of such data must decompress it first