Calculation of Memory Bound nature vs Roofline numbers

Hi,

I ran the following using cublasSgemm() function from the library and no shared memory optimization:

  • Kernel name: volta_sgemm_32x32_sliced1x4nn
  • Input matrix size: m=n=k=128
  • Machine name: RTX 2080Ti

I see the following Hierarchical roofline in NCU for given kernel:


which shows that the kernel is not memory bound for any level in the memory hierarchy. However, if I do rough calculations for say DRAM (since the peak BW is available online):

Min data reuse or op/B required for workload to be not memory bound
= Peak FLOPs/Peak BW
= 14.2TFLOPS/616GB/s
=~ 23 FLOPs/B

From NCU’s Memory Workload Analysis Section,
Bytes fetched from Device Memory = 131328B
Number of FLOPs (based on input) = 128x128x128 = 2097152
DRAM data reuse or observed op/B = #operations/bytes fetched
= 2097152/131328
=~ 16 FLOPs/B

The above calculations show that the workload should be memory bound, in contradiction to what NCU’s roofline plot shows. Also, the peak work is a smaller number in NCU (I read it is because of the difference in boost clock and normal clock) and the peak BW is bigger than the peak BW reported on whitepaper.

Can somebody explain why this GEMM is not memory bound on RTX2080Ti even though it theoretically should be memory bound?

Thanks!

Thanks for reaching out. It’s hard to say exactly where the discrepancy is coming from. Would you be able to share the Nsight Compute report that this screenshot comes from so we can dig deeper on our end?

Thanks for replying. Actually this workload is fine and is supposed to be compute bound. There was a mistake in my calculation as I did not include the addition operations when calculating theoretical op/byte. However, there is another workload which should be memory bound and its reported arithmetic intensity is much higher in NCU.

  • Kernel name: volta_sgemm_32x128_sliced1x4nn (2, 25, 1)x(256,1,1)
  • Input matrix size: m=3136 n=64 k=64
  • Machine name: RTX 2080Ti

Single Precision min Data Reuse at DRAM for workload to be not memory bound (taking numbers from whitepaper):
Ridge point AI = 13.45*1024/616=22.35

Theoretical min data reuse for DRAM:
Theoretical AI = 231366464/4(313664 + 6464 + 3136*64)= 15.8
(because 4 bytes in single precision)

But the observed AI from the hierarchical roofline plot is = 32.24

I think the high number in Nsight Compute is because of 0 stores at Device Memory. If I use the total bytes from Device memory under MemoryWorkloadAnalysis, I get a number similar to observed AI. This happens because Nsight Compute only tracks the kernel and the output is copied using cudamemcpy API after the kernel ends. But even copying it from device to host memory, it would go through the device memory (Please confirm this). In that case, is Nsight Compute’s reported AI the correct value since it does not include the store bytes to device memory?

You can find the file for the given run here:

Based on the data you shared, it looks like the resulting matrix is small enough to fit in L2, and is never written back to device memory because the default cache policy for L2 is writeback. Therefore, the achieved AI is higher than the theoretical, because the theoretical assumes the data will be stored back to device memory. You can see in the report, the stores go to L2, but not to device memory.

But even copying it from device to host memory, it would go through the device memory (Please confirm this).

The copied data is already in L2 in this case and is sourced from there. It does not have to access device memory for the copy if all the data is already in the L2. For your given problem size, everything is only ever stored in L2. If you increase the problem size, you should start seeing DRAM stores once the result matrix is larger than L2.

1 Like