Turing L2 cache

The deviceQuery for 2080Ti says

(68) Multiprocessors, ( 64) CUDA Cores/MP:     4352 CUDA Cores
L2 Cache Size:                                 5767168 bytes

Considering the fact that L2 is shared among all SMs, 5767168/68=84811.2941 which is not a power of 2 number. Usually, the number of sets, ways and block size are power of 2. For that number, we can estimate (S=41)(W=16)(B=128) which yields 83,968‬ bytes or (S=44)(W=15)(B=128) which yields 84,480 bytes.

Besides microbenchmarking, I am curious to know if there are more information about that.

Here are some hints:

  1. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

  2. 5632KB /512KB = 11

  3. The TU102 die has 12 memory controllers (each corresponding to 32 bits of bus width), but the 2080Ti only has a 352-bit memory bus width.