L1 bandwidth benchmark

I tried to test the L1 cache bandwidth using PTX inline benchmark. I launched the kernel below with a single thread on Xevier. This kernel scans an array of 64 elements, where each element is 8 bytes, and loads every element to the register r1. The thread scans the array 1024 times. Overall, 8x64x1024 bytes are loaded. I measured the running time of the process and found out that the max bandwidth of the L1 cache is about 60 GB/sec per SM (assuming every SM can run 2048 threads). This result is much lower than what I expected. Can anyone explain it? Is my benchmark wrong?

Here is my code:

What were you expecting? Back in the day, I used to expect at least 1 TB/s aggregate shared/L1 bw across an entire device. That would translate to 16 SMs by your measurement. That seems plausible.

Xavier is a volta architecture device. According to these measurements on V100, the L1 cache supported ~100 bytes per cycle load throughput (*). If we multiply that by the xavier clock (base clock) of ~850MHz, we get ~85GB/s/SM. Your 60GB/s/SM number seems to be “in the ballpark”. And I don’t know for sure that Xavier (cc7.2) duplicates exactly the V100 (cc7.0) SM design. (In fact, I know there are differences.)

You might want to study that benchmarking paper to see how it compares with your approach.

I also generally encourage people not to post pictures of code. Post it as text, with proper formatting (e.g. use </> button after selecting your code)