How to use L2 compression? How to send L1D to shared memory?


Hi! I see this photo and find some data transfers are 0. So I am wondering what they are and how can I somehow optimize them. Thanks!!!

a general overview is given here starting on slide 29.

there is a sample code.

there is a section in the programming guide.

1 Like

Wow, that’s really useful and attractive!!!

Just one more question left, is it possible to transfer data from L1D to shared memory? Can I … somehow manually utilize that?

Thank you!!!

I tried your sample code, but still the nsight compute shows no L2 compress, why?

(base) a100-01% ./cudaCompressibleMemory
GPU Device 0: “Ampere” with compute capability 8.0
Generic memory compression support is available
allocating non-compressible Z buffer
Running saxpy on 167772160 bytes of Compressible memory
Running saxpy with 216 blocks x 1024 threads = 0.378 ms 1.332 TB/s
Running saxpy on 167772160 bytes of Non-Compressible memory
Running saxpy with 216 blocks x 1024 threads = 0.398 ms 1.266 TB/s



output-file-full.nsight-cuprof-report.zip (394.4 KB)

I’m not sure what L1D is. If you are referring to a cache, you cannot explicitly transfer data from a cache anywhere. You must target a location in global (or local) space, and if that data happens to be in a cache, then it will be used from the cache.

I don’t know why, I haven’t studied it personally, but others have reported a similar observation, see here for example. This may also be of interest, where they seem to have success with the sample code.

1 Like

I wonder if this metric is only active on SM9.0?

Looking at the Hopper Tuning Guide:

"The NVIDIA Hopper architecture allows CUDA compute kernels to benefit from the new inline compression (ILC). "

In the sample code linked above, there is the comment:

// On SM 8.0 and 8.6 GPUs compressible buffer can only be initialized 
// through cudaMemcpy.

So it looks like hardware compression is only active on Ampere during host->device transfers, which would not pass through the L2, unlike Hopper where compression would be active over the life of the kernel.

I’ve never understood that to be true. I’m reasonably confident that I’ve run the experiment in the past where I do a small host->device transfer (data size that fits in L2 footprint), then run a kernel that uses that data, and observe 100% hits in the L2.

Have you witnessed a counter-example?

1 Like

No, it was a somewhat ignorant comment on my part.

It seems odd, having been raised on the Nsight Compute forum, there was no explanation. It would be interesting to see the sample run on Hopper, (and Ada too - I could find no mention of memory compression in documentation there).

1 Like

I checked your link, that case also can not see L2 compression rate, I even guess that is nsight compute’s internal problem…