Hello, I have questions about L2 compression metrics.
When I profile the following sample code on RTX 4090, I get the report like the below figure.
In the case of box 1, does it mean the data that can be compressed out of 335.54MB (L1 → L2 write)?
In the case of box 2, why is it 0.00B? Because of this, the compression ratio is displayed as inf. How can I interpret it?
Is it correct that the L2 compression block on the memory chart operates at the time it is written to L2? In other words, even if it is not eviction to device memory, is it correct that it is compressed and stored in L2?
Does l2 of rtx 4090 work with write back + write allocation?
Thank you in advance.
This seems like a bug. Can you share the report so we can investigate it further?
For 3 - the data is compressed in L2 regardless of whether it’s evicted to device memory
For 4 - can you clarify what you mean here?
Sure. Here is the ncu report.
Please refer to the result of ‘0 - 1198 -init’ kernel.
Regarding question 4, I am curious about which cache policy the RTX4090 uses.
(e.g. write-back + write allocate / write-back + no write allocate)
ncu_raw.ncu-rep (476.7 KB)
Additionally, I am curious about whether there is a minimum granularity for enabling compression.
Looking at the gtc slide (https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21819-optimizing-applications-for-nvidia-ampere-gpu-architecture.pdf 30page), it looks like it can be compressed if 256B (8 sectors) is written to the L2 cache.
But when I tested, if write traffic smaller than 8MB occurs in the L2 cache, it appears that no input is entered into the L2 Compression block on the NCU report.
So it seems like there is a minimum size for compression to run on, am I correct?
Thanks for your help.
Thanks for sharing the report. I will file an internal ticket and get back to you when I have more information. For the cache and compression questions, I’m not sure, but you have more luck asking on the CUDA Programming and Performance forum.