L1-cache behavior not as what document saied

Use this kernel for test cache behavior.
kernel.cu (6.3 KB)
I just want to know cuda cache behavior with a stride of 128 float for each therad to ensure that thread in a warp always not hit l1 cache and l2 cache.
I have tested L1 cache behavior and L2 cache behavior with compiler_opt:compute_86,sm_86 on"NVIDIA GeForce RTX 3050 Ti Laptop GPU"
L2 cache is the same as what compute capability said.
while L1 cache make me confused.

while I’m confused about L1 cache behavior for test_cacheL1 kernel.
For this test result:
j, 56 - j; 10.266317 load 57, numRegisters:40
j, 57 - j; 11.777024 load 58, numRegisters:40
j, 58 - j; 11.803853 load 59, numRegisters:38
j: 59 - j: 11.906048 load 60, numRegisters:40
j: 60 - j: 12.126719 load 61, numRegisters:40
j: 61 - j: 12.183142 load 62, numRegisters:40
j, 62 - j: 11.204202 load 63, numRegisters:38
j, 63 - j: 12.265574 load 64, numRegisters:40
why load 58 cost so much time than load 57?
why load 63 cost less time than load 61?

What is your expectation for L1 that is not met?

Unless I’m misreading the code you will hit in the L1 cache a little more than 7/8 accesses. The for loop is incrementing j which is incrementing by 4B. There are 32B in a sector so you will get 1 miss and 7 hits. For values other than 63-j the 1st access and the 2nd access will overlap near the end of the loop resulting in slightly higher hit rates.

I’m not clear why the number of registers change. This would require diff’ing the SASS. I would recommend passing the offset of the 2nd access as a parameter to avoid compiler differences.

why load 58 cost so much time than load 57?
(11.906048 - 11.803853) / 11.803853 = 0.008657766

This is a negligible difference given (a) use of CUDA events for timing, and (b) duration of the kernel.

why load 63 cost less time than load 61?

From your numbers 63 costs more time than 61. 63 has less hits in L1 as the 1st and 2nd load address will never overlap.

For this kernel , I use nvidia nsignt compute to test kernel bottleneck.
For all j, 58 - j and j,50 -j. bottleneck is always the bandwidth.
(11.906048 - 11.803853) / 11.803853 = 0.008657766.
I means load 57:56-j compare to load 58:58 - j,not index compare.
This increase about 1.5 / 10 = 0.15 ms (I have tested 10 times), obvisouly lots of cyicle compare to the GPU Max Clock rate(1485 MHz (1.49 GHz)) And Memory Clock rate
(6001 Mhz)

See the test result:
j, 56 - j; 10.266317 load 57, numRegisters:40
j, 57 - j; 11.777024 load 58, numRegisters:40 (not make sence compare to j, 56 - j; 10.266317 load 58)
j, 58 - j; 11.803853 load 59, numRegisters:38
j: 59 - j: 11.906048 load 60, numRegisters:40
j: 60 - j: 12.126719 load 61, numRegisters:40
j: 61 - j: 12.183142 load 62, numRegisters:40
j, 62 - j: 11.204202 load 63, numRegisters:38 (not make sense compare toj: 61 - j: 12.183142 load 62)
j, 63 - j: 12.265574 load 64, numRegisters:40

apologies for my typing error.
I means why load 62 cost less time than load 61?

anyone can help it?