Volta bandwidth to SM

From the Hot Chips presentation it shows L1/Shared Memory bandwidth to each of the 4 blocks is 64bytes/cycle.
With 4 blocks its 64*4 = 256 bytes/clock ?
Also the L2 to SM bandiwdth shows 128 bytes/clock
Anyone who got V100’s got a chance to measure this ?

the first one is quite obvious - with 16 ld/st engines, each dispatcher can perform 16 loads or stores per cycle

i will be surprised if anyone on open market already got a gv100

Its Actually 8 LD/ST Engine per partition on a V100. What is not clear is can it sustain the cumulative bandwidth of 256 bytes/clock