a)
“or implemented in software”
i am still much of the opinion that the compression is hardware implemented, as opposed to software implemented, in order to reduce effect on memory latency
a hardware implementation may imply a few clocks, a software implementation may imply a number of instructions, in turn implying many more clocks
software implementation would likely also imply dedicated, additional processors/ processing to be dropped somewhere
and note this excerpt from the whitepaper:
“The bandwidth savings from this compression is realized a second time when clients such as the Texture Unit later read the data.”
you would then need software processing close to such texture units as well
b)
not all data may be compressible
again from the whitepaper:
As illustrated in the preceding figure, our compression engine has multiple layers of compression algorithms. Any block going out to memory will first be examined to see if 4x2 pixel regions within the block are constant, in which case the data will be compressed 8:1 (i.e., from 256B to 32B of data, for 32b color). If that fails, but 2x2 pixel regions are constant, we will compress the data 4:1.
In this mode, we calculate the difference between each pixel in the block and its neighbor, and then try to pack these different values together using the minimum number of bits. For example if pixel A’s red value is 253 (8 bits) and pixel B’s red value is 250 (also 8 bits), the difference is 3, which can be represented in only 2 bits.
Finally, if the block cannot be compressed in any of these modes, then the GPU will write out data
uncompressed, preserving the lossless rendering requirement.
evidently, to be compressible, the data must satisfy some of the above conditions/ stipulations
c)
cuda texture fetches would go through texture cache, and then there are the sm texture units:
The GPU’s dedicated hardware Texture units are a valuable resource for compute programs with a need
to sample or filter image data. (kepler whitepaper)
i do not know whether compression would be enabled for these paths as well; i suppose it might be: if textures are compressed before pushing them into memory, then any read of such data must decompress it first