How to use L2 compression? How to send L1D to shared memory?

202476410arsmart · December 30, 2023, 3:16am

Hi! I see this photo and find some data transfers are 0. So I am wondering what they are and how can I somehow optimize them. Thanks!!!

Robert_Crovella · December 30, 2023, 4:21am

a general overview is given here starting on slide 29.

there is a sample code.

there is a section in the programming guide.

202476410arsmart · December 30, 2023, 6:44am

Wow, that’s really useful and attractive!!!

Just one more question left, is it possible to transfer data from L1D to shared memory? Can I … somehow manually utilize that?

Thank you!!!

202476410arsmart · December 30, 2023, 8:39am

I tried your sample code, but still the nsight compute shows no L2 compress, why?

(base) a100-01% ./cudaCompressibleMemory
GPU Device 0: “Ampere” with compute capability 8.0
Generic memory compression support is available
allocating non-compressible Z buffer
Running saxpy on 167772160 bytes of Compressible memory
Running saxpy with 216 blocks x 1024 threads = 0.378 ms 1.332 TB/s
Running saxpy on 167772160 bytes of Non-Compressible memory
Running saxpy with 216 blocks x 1024 threads = 0.398 ms 1.266 TB/s

output-file-full.nsight-cuprof-report.zip (394.4 KB)

Robert_Crovella · December 30, 2023, 4:13pm

I’m not sure what L1D is. If you are referring to a cache, you cannot explicitly transfer data from a cache anywhere. You must target a location in global (or local) space, and if that data happens to be in a cache, then it will be used from the cache.

I don’t know why, I haven’t studied it personally, but others have reported a similar observation, see here for example. This may also be of interest, where they seem to have success with the sample code.

rs277 · December 30, 2023, 7:17pm

I wonder if this metric is only active on SM9.0?

Looking at the Hopper Tuning Guide:

"The NVIDIA Hopper architecture allows CUDA compute kernels to benefit from the new inline compression (ILC). "

In the sample code linked above, there is the comment:

// On SM 8.0 and 8.6 GPUs compressible buffer can only be initialized 
// through cudaMemcpy.

So it looks like hardware compression is only active on Ampere during host->device transfers, which would not pass through the L2, unlike Hopper where compression would be active over the life of the kernel.

Robert_Crovella · December 30, 2023, 7:28pm

I’ve never understood that to be true. I’m reasonably confident that I’ve run the experiment in the past where I do a small host->device transfer (data size that fits in L2 footprint), then run a kernel that uses that data, and observe 100% hits in the L2.

Have you witnessed a counter-example?

rs277 · December 30, 2023, 7:42pm

No, it was a somewhat ignorant comment on my part.

It seems odd, having been raised on the Nsight Compute forum, there was no explanation. It would be interesting to see the sample run on Hopper, (and Ada too - I could find no mention of memory compression in documentation there).

202476410arsmart · December 31, 2023, 2:23am

I checked your link, that case also can not see L2 compression rate, I even guess that is nsight compute’s internal problem…

Topic		Replies	Views
How to optimize for cache + shared memory on Fermi? CUDA Programming and Performance	8	3104	April 25, 2010
bzip on CUDA Is it enough memory? CUDA Programming and Performance	11	8163	March 2, 2010
Cant understand Shared Memory Concept ! I want to talk Live to somebody who knows it !!& CUDA Programming and Performance	2	1483	April 13, 2009
What's the difference between L1 cache and the shared memory CUDA Programming and Performance	4	15264	October 29, 2011
shared memory loading CUDA Programming and Performance	0	2502	October 24, 2008
CUDAmemcpy takes too long CUDA Programming and Performance	2	4418	July 16, 2009
How can I check and see if my GPU is using L1 cache CUDA Programming and Performance	7	3062	June 9, 2011
CUDA: How do I use L2 cache in Fermi? Legacy PGI Compilers	3	5438	June 22, 2011
Performance test sharedmemory <-> globalmemory CUDA Programming and Performance	2	3973	May 30, 2008
cudaMemcpy() behavior question CUDA Programming and Performance	4	6692	August 8, 2007

How to use L2 compression? How to send L1D to shared memory?

Related topics