Hello.
I would like to profile about compute data compression of A100.
When I execute the cuda sample code about compressible memory (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/cudaCompressibleMemory), I got the result like below.
I tested this on A100-PCIE-40GB and CUDA version 12.1
As you can see, there is no significant improvement and there is no traffic to/from L2 compression block in NCU memory chart too.
So, I wonder if there is any way to test this function.
Even a small answer would be of great help to me.
Thanks in advance.
GPU Device 0: “Ampere” with compute capability 8.0
Generic memory compression support is available
allocating non-compressible Z buffer
Running saxpy on 167772160 bytes of Compressible memory
Running saxpy with 216 blocks x 1024 threads = 0.387 ms 1.300 TB/s
Running saxpy on 167772160 bytes of Non-Compressible memory
Running saxpy with 216 blocks x 1024 threads = 0.395 ms 1.276 TB/s
Running saxpy with 108 blocks x 1024 threads = 0.308 ms 1.633 TB/s
Running saxpy with 216 blocks x 1024 threads = 0.301 ms 1.672 TB/s
Running saxpy with 4000 blocks x 1024 threads = 0.301 ms 1.672 TB/s
Running saxpy_single with 10240 blocks x 1024 threads = 0.310 ms 1.622 TB/s
On a RTX4090 compression works fine (I did not check the ncu report)
Running saxpy on 167772160 bytes of Compressible memory
Running saxpy with 256 blocks x 768 threads = 0.104 ms 4.819 TB/s
Running saxpy on 167772160 bytes of Non-Compressible memory
Running saxpy with 256 blocks x 768 threads = 0.585 ms 0.861 TB/s
Thanks for your reply.
I got the NCU report like the below figure when I profiled with RTX 4090.
That is, compression seems to work well with rtx4090.
But I’m not sure why the L2 compression input is 161.11MB.
Do you know if there is any more detailed documentation about data compression?