I tried to test the L1 cache bandwidth using PTX inline benchmark. I launched the kernel below with a single thread on Xevier. This kernel scans an array of 64 elements, where each element is 8 bytes, and loads every element to the register r1. The thread scans the array 1024 times. Overall, 8x64x1024 bytes are loaded. I measured the running time of the process and found out that the max bandwidth of the L1 cache is about 60 GB/sec per SM (assuming every SM can run 2048 threads). This result is much lower than what I expected. Can anyone explain it? Is my benchmark wrong?
Here is my code: