Compute data compression does not work

yongsang623 · August 24, 2023, 7:51am

Hello.
I would like to profile about compute data compression of A100.
When I execute the cuda sample code about compressible memory (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/cudaCompressibleMemory), I got the result like below.
I tested this on A100-PCIE-40GB and CUDA version 12.1
As you can see, there is no significant improvement and there is no traffic to/from L2 compression block in NCU memory chart too.

As far as I know, other people have had similar problems, but none seem to have resolved it.
( [A100] cuCompressibleMemory not work · Issue #68 · NVIDIA/cuda-samples · GitHub
Compute data compression in Ampere A100 - #3 by sazc )

So, I wonder if there is any way to test this function.

Even a small answer would be of great help to me.
Thanks in advance.

GPU Device 0: “Ampere” with compute capability 8.0

Generic memory compression support is available
allocating non-compressible Z buffer
Running saxpy on 167772160 bytes of Compressible memory
Running saxpy with 216 blocks x 1024 threads = 0.387 ms 1.300 TB/s
Running saxpy on 167772160 bytes of Non-Compressible memory
Running saxpy with 216 blocks x 1024 threads = 0.395 ms 1.276 TB/s

striker159 · August 24, 2023, 9:36am

I am also unable to reproduce the numbers reported in this GTC talk https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21819-optimizing-applications-for-nvidia-ampere-gpu-architecture.pdf (slide 41) for A100 with cuda 12.2 and driver 535.86.10. The saxpy_single kernel from the slides does not work as well.

Running saxpy with 108 blocks x 1024 threads = 0.308 ms 1.633 TB/s
Running saxpy with 216 blocks x 1024 threads = 0.301 ms 1.672 TB/s
Running saxpy with 4000 blocks x 1024 threads = 0.301 ms 1.672 TB/s

Running saxpy_single with 10240 blocks x 1024 threads = 0.310 ms 1.622 TB/s

On a RTX4090 compression works fine (I did not check the ncu report)

Running saxpy on 167772160 bytes of Compressible memory
Running saxpy with 256 blocks x 768 threads = 0.104 ms 4.819 TB/s
Running saxpy on 167772160 bytes of Non-Compressible memory
Running saxpy with 256 blocks x 768 threads = 0.585 ms 0.861 TB/s

yongsang623 · August 29, 2023, 7:48am

Thanks for your reply.
I got the NCU report like the below figure when I profiled with RTX 4090.
That is, compression seems to work well with rtx4090.
But I’m not sure why the L2 compression input is 161.11MB.
Do you know if there is any more detailed documentation about data compression?

system · September 12, 2023, 7:49am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Compute data compression in Ampere A100 CUDA Programming and Performance cuda	2	1582	April 24, 2023
A100 Compute Data Compression CUDA Programming and Performance	2	979	April 24, 2023
Ncu profiling l2 cache compression rate CUDA Programming and Performance	2	1333	March 4, 2022
Ncu profiling metrics about L2 compression are always 0 Nsight Compute	2	917	March 25, 2022
L2 compress does not work on nsight compute, why? Nsight Compute	3	764	April 26, 2024
Optimizing Data Transfer Using Lossless Compression with NVIDIA nvcomp Technical Blog	5	959	January 11, 2021
Nsignt Compute - L2 Compression metric Nsight Compute	6	1646	July 30, 2024
nvCOMP - get compressed data from device GPU-Accelerated Libraries nvcomp	17	1045	April 19, 2024
Accelerating Lossless GPU Compression with New Flexible Interfaces in NVIDIA nvCOMP Technical Blog	0	589	March 18, 2022
Problem in nvcomp deflate compression GPU-Accelerated Libraries cuda , nvcomp	0	110	November 29, 2024

Compute data compression does not work

Related topics