Program presents higher bandwidth than theoretical maximum

I’m doing some performance tests on a program that I made, for that I used the Tesla K20Xm, P100 and V100. This last one, presented some strange behavior.

A quick explanation about the program: it takes a matrix, do some operations with it and then each node is “propagated” to another matrix, then the same operations are done with this new matrix and so it goes on. The matrix are global and the propagation is made via

    ... do operations ...
    newMatrix[iNew] = oldMatrix[iOld]; 
    ... swap matrix pointers ...

For the P100 and K20Xm, when I decreased the size of the matrix (128x128x128 to 32x32x32), the performance reduced at least 15%. On the V100, the oppositte ocurred, the performance is about 40% higher for the minor matrix.

Also, I made a theoretical bandwidthTest for all the GPUs using de bandwidthTest from CUDA-samples (despite its not being recommended for performance test, the code is really straight forward and the results are trustful).

The question comes now: the bandwidth of the 32 matrix is greater than the theoretical max from the bandwidthTest, about 30% higher.

The explanations I can find is the L1 cache, that has ~128kB for the V100 and 24kB for the P100. Another possible cause is something on the compute capability (P100 supports 6.0 and V100 7.0).

I want to know if this is a common thing on the V100 and why it happens. If there’s need for source code or the numerical results, just ask.