Calculation of Memory Bound nature vs Roofline numbers

tanvishr1197 · March 27, 2023, 2:47pm

Hi,

I ran the following using cublasSgemm() function from the library and no shared memory optimization:

Kernel name: volta_sgemm_32x32_sliced1x4nn
Input matrix size: m=n=k=128
Machine name: RTX 2080Ti

I see the following Hierarchical roofline in NCU for given kernel:

which shows that the kernel is not memory bound for any level in the memory hierarchy. However, if I do rough calculations for say DRAM (since the peak BW is available online):

Min data reuse or op/B required for workload to be not memory bound
= Peak FLOPs/Peak BW
= 14.2TFLOPS/616GB/s
=~ 23 FLOPs/B

From NCU’s Memory Workload Analysis Section,
Bytes fetched from Device Memory = 131328B
Number of FLOPs (based on input) = 128x128x128 = 2097152
DRAM data reuse or observed op/B = #operations/bytes fetched
= 2097152/131328
=~ 16 FLOPs/B

The above calculations show that the workload should be memory bound, in contradiction to what NCU’s roofline plot shows. Also, the peak work is a smaller number in NCU (I read it is because of the difference in boost clock and normal clock) and the peak BW is bigger than the peak BW reported on whitepaper.

Can somebody explain why this GEMM is not memory bound on RTX2080Ti even though it theoretically should be memory bound?

Thanks!

jmarusarz · March 30, 2023, 8:50pm

Thanks for reaching out. It’s hard to say exactly where the discrepancy is coming from. Would you be able to share the Nsight Compute report that this screenshot comes from so we can dig deeper on our end?

tanvishr1197 · March 31, 2023, 1:50pm

Thanks for replying. Actually this workload is fine and is supposed to be compute bound. There was a mistake in my calculation as I did not include the addition operations when calculating theoretical op/byte. However, there is another workload which should be memory bound and its reported arithmetic intensity is much higher in NCU.

Kernel name: volta_sgemm_32x128_sliced1x4nn (2, 25, 1)x(256,1,1)
Input matrix size: m=3136 n=64 k=64
Machine name: RTX 2080Ti

Single Precision min Data Reuse at DRAM for workload to be not memory bound (taking numbers from whitepaper):
Ridge point AI = 13.45*1024/616=22.35

Theoretical min data reuse for DRAM:
Theoretical AI = 231366464/4(313664 + 6464 + 3136*64)= 15.8
(because 4 bytes in single precision)

But the observed AI from the hierarchical roofline plot is = 32.24

I think the high number in Nsight Compute is because of 0 stores at Device Memory. If I use the total bytes from Device memory under MemoryWorkloadAnalysis, I get a number similar to observed AI. This happens because Nsight Compute only tracks the kernel and the output is copied using cudamemcpy API after the kernel ends. But even copying it from device to host memory, it would go through the device memory (Please confirm this). In that case, is Nsight Compute’s reported AI the correct value since it does not include the store bytes to device memory?

You can find the file for the given run here:

jmarusarz · May 18, 2023, 8:36pm

Based on the data you shared, it looks like the resulting matrix is small enough to fit in L2, and is never written back to device memory because the default cache policy for L2 is writeback. Therefore, the achieved AI is higher than the theoretical, because the theoretical assumes the data will be stored back to device memory. You can see in the report, the stores go to L2, but not to device memory.

But even copying it from device to host memory, it would go through the device memory (Please confirm this).

The copied data is already in L2 in this case and is sourced from there. It does not have to access device memory for the copy if all the data is already in the L2. For your given problem size, everything is only ever stored in L2. If you increase the problem size, you should start seeing DRAM stores once the result matrix is larger than L2.

Topic		Replies	Views
How to determine whether a GEMM is bound on L1 or L2? CUDA Programming and Performance	13	380	June 1, 2024
According to roofline, L1 is compute bound, L2 and DRAM is memory bound? Nsight Compute	4	1038	July 6, 2024
Why the Compute Throughput's value is different from the actual Performance / Peak Performance Nsight Compute cuda , kernel , nsight , profiling	7	3161	October 28, 2022
GEMM is memory bound? (quite large, but tensor core) CUDA Programming and Performance	2	620	March 25, 2024
Cuda roofline analysis when kernel is below the roof Nsight Compute	4	1087	March 9, 2023
Understanding Memory Tables and Roofline Modell Nsight Compute	3	653	August 19, 2022
Memory workload analysis CUDA Programming and Performance	1	1010	January 7, 2022
Confused about the L1/SMEM BW reported by Nsight-Compute Hierarchical Roofline plots Nsight Compute	13	1723	August 17, 2023
Accelerating HPC Applications with NVIDIA Nsight Compute Roofline Analysis Technical Blog	2	356	September 25, 2024
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	1829	August 8, 2023

Calculation of Memory Bound nature vs Roofline numbers

Related topics