I used the matrix multiplication example in Chap. 6 of the Programming Guide and
got some strange FLOP counts. Instead of computing C = A * B, I modified the code
slightly to compute C = A^n * B:
void mul_(const float* A, const float* B, int hA, int wA, int wB, float* C)
{
…
…
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);
// n can be 10 or 20
int n = 20;
unsigned int timer = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer));
CUT_SAFE_CALL( cutStartTimer( timer));
for (int m=1; m<=n; m++) {
Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);
cudaMemcpy(Bd, Cd, size, cudaMemcpyDeviceToDevice);
printf( “n = %d, time: %f (ms)\n”, m, cutGetTimerValue( timer));
}
CUT_SAFE_CALL( cutStopTimer( timer));
printf( “Processing time: %f (ms)\n”, cutGetTimerValue( timer));
CUT_SAFE_CALL( cutDeleteTimer( timer));
cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);
…
…
}
For matrix size 1024 x 1024 and n = 10, I got the following timing result with a 8800GT:
n = 1, time: 0.070000 (ms)
n = 2, time: 0.134000 (ms)
n = 3, time: 0.153000 (ms)
n = 4, time: 0.173000 (ms)
n = 5, time: 0.189000 (ms)
n = 6, time: 0.204000 (ms)
n = 7, time: 0.219000 (ms)
n = 8, time: 0.233000 (ms)
n = 9, time: 0.249000 (ms)
n = 10, time: 0.263000 (ms)
Processing time: 0.266000 (ms)
So, FLOPS = (1024)^3 * 10 / 0.000266 = 40.3 Teraflops!
Also, these times can change drastically from run to run.
When I repeated with n = 20, I got:
n = 1, time: 0.079000 (ms)
n = 2, time: 53.512001 (ms)
n = 3, time: 53.542999 (ms)
n = 4, time: 53.563999 (ms)
n = 5, time: 53.584999 (ms)
n = 6, time: 53.605000 (ms)
n = 7, time: 53.625999 (ms)
n = 8, time: 53.646000 (ms)
n = 9, time: 53.665001 (ms)
n = 10, time: 53.763000 (ms)
n = 11, time: 53.783001 (ms)
n = 12, time: 53.801998 (ms)
n = 13, time: 53.821999 (ms)
n = 14, time: 53.841999 (ms)
n = 15, time: 188.386002 (ms)
n = 16, time: 188.408005 (ms)
n = 17, time: 188.423004 (ms)
n = 18, time: 188.440002 (ms)
n = 19, time: 188.455002 (ms)
n = 20, time: 322.501007 (ms)
Processing time: 322.506989 (ms)
So, FLOPS = (1024)^3 * 20 / 0.3225 = 66.5 Gigaflops, which is more reasonable.
Note that the above cumulative time does not increase smoothly, but in steps.
In both cases, the results are correct by checking with a CPU version of the code.
Can FLOPS be measured more repeatably in CUDA?