However, as you can see from the figure, there are 6.72 million shared memory access requests, which are separate from the global access.
So is shared memory access actually part of the global access, or else how to explain that L1’s bandwidth is 1T/s
The bandwidth for L1 is 128 B/cycle on almost all chips. I agree that L1 Achieved of 2.4 TFLOPS/s / 2.34 FLOP/byte is ~1 TB/s; however, this is the achieved bandwidth not the theoretical bandwidth. The memory diagram does not have the L1 bandwidth which would be two interfaces on the left side of the L1/TEX Cache.
In the Memory Table (L1) it is showing you are at 11.8% of Peak indicating a much higher peak bandwidth. The challenge is the %Peak is in L1 wavefronts which is not the same as bandwidth back to the register file.
Sorry,maybe I didn’t state my question clearly. From the last entry in the memory table, you can see that the total global memory access is about 1GB. The execution time of the program is 1ms, so the L1 achieved bandwidth is 1TB/S, which is consistent with the results obtained from roofline. But as you can see from the last diagram, there are 6.72m requests for shared memory, so the total data transferred by L1 should be global memory access and shared memory access, which should be greater than 1GB. So achieved bandwidth should be greater than 1TB/S. If that’s the case, it’s not consistent with roofline.
The NCU install has a directory “sections” that contains the section files. For L1 bytes is the return bandwidth of LSU. For measured bytes what is counted differ between each chip. For GH100 and GB100 it should be noted that the L1 does not include shared memory read by the Tensor cores as this does not increment the LSU writeback to RF.