Program presents higher bandwidth than theoretical maximum

waine2000 · July 8, 2019, 5:03pm

I’m doing some performance tests on a program that I made, for that I used the Tesla K20Xm, P100 and V100. This last one, presented some strange behavior.

A quick explanation about the program: it takes a matrix, do some operations with it and then each node is “propagated” to another matrix, then the same operations are done with this new matrix and so it goes on. The matrix are global and the propagation is made via

while(!end){
    ... do operations ...
    newMatrix[iNew] = oldMatrix[iOld]; 
    ... swap matrix pointers ...
}

For the P100 and K20Xm, when I decreased the size of the matrix (128x128x128 to 32x32x32), the performance reduced at least 15%. On the V100, the oppositte ocurred, the performance is about 40% higher for the minor matrix.

Also, I made a theoretical bandwidthTest for all the GPUs using de bandwidthTest from CUDA-samples (despite its not being recommended for performance test, the code is really straight forward and the results are trustful).

The question comes now: the bandwidth of the 32 matrix is greater than the theoretical max from the bandwidthTest, about 30% higher.

The explanations I can find is the L1 cache, that has ~128kB for the V100 and 24kB for the P100. Another possible cause is something on the compute capability (P100 supports 6.0 and V100 7.0).

I want to know if this is a common thing on the V100 and why it happens. If there’s need for source code or the numerical results, just ask.

Topic		Replies	Views
Effective Bandwidth Problem CUDA Programming and Performance	13	7855	March 23, 2011
Lower then expected bandwidth on C2050 CUDA Programming and Performance	11	9211	October 26, 2010
Why my program bandwidth exceeds the standard bandwidth? CUDA Programming and Performance	6	1073	April 3, 2015
How to correctly write code to test A100 L2 bandwidth？ CUDA Programming and Performance	6	2574	October 17, 2023
Quadro GV100 gives so low memory bandwidth CUDA Programming and Performance	12	987	January 6, 2021
Using bandwidthTest, D2D performance exceeds theoretical bandwidth CUDA Programming and Performance cuda	1	463	October 27, 2022
Bandwidth measurement Theortical bandwidth vs BandwidthTest(SDK) results CUDA Programming and Performance	4	1655	May 30, 2011
K80 bandwidth test CUDA Programming and Performance	16	10604	July 4, 2015
Perplexing CUDA performance experiment of GTX560(1G) and GTX1050TI(4G) CUDA Programming and Performance	9	1288	July 31, 2018
Basic question regarding bandwidthTest on Tesla C1060 CUDA Programming and Performance	1	1423	January 13, 2009

Program presents higher bandwidth than theoretical maximum

Related topics