 # Strange FLOP counts

I used the matrix multiplication example in Chap. 6 of the Programming Guide and
got some strange FLOP counts. Instead of computing C = A * B, I modified the code
slightly to compute C = A^n * B:

void mul_(const float* A, const float* B, int hA, int wA, int wB, float* C)
{

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);

// n can be 10 or 20
int n = 20;
unsigned int timer = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer));
CUT_SAFE_CALL( cutStartTimer( timer));
for (int m=1; m<=n; m++) {
Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);
cudaMemcpy(Bd, Cd, size, cudaMemcpyDeviceToDevice);
printf( “n = %d, time: %f (ms)\n”, m, cutGetTimerValue( timer));
}

CUT_SAFE_CALL( cutStopTimer( timer));
printf( “Processing time: %f (ms)\n”, cutGetTimerValue( timer));
CUT_SAFE_CALL( cutDeleteTimer( timer));

cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

}

For matrix size 1024 x 1024 and n = 10, I got the following timing result with a 8800GT:
n = 1, time: 0.070000 (ms)
n = 2, time: 0.134000 (ms)
n = 3, time: 0.153000 (ms)
n = 4, time: 0.173000 (ms)
n = 5, time: 0.189000 (ms)
n = 6, time: 0.204000 (ms)
n = 7, time: 0.219000 (ms)
n = 8, time: 0.233000 (ms)
n = 9, time: 0.249000 (ms)
n = 10, time: 0.263000 (ms)
Processing time: 0.266000 (ms)

So, FLOPS = (1024)^3 * 10 / 0.000266 = 40.3 Teraflops!
Also, these times can change drastically from run to run.

When I repeated with n = 20, I got:
n = 1, time: 0.079000 (ms)
n = 2, time: 53.512001 (ms)
n = 3, time: 53.542999 (ms)
n = 4, time: 53.563999 (ms)
n = 5, time: 53.584999 (ms)
n = 6, time: 53.605000 (ms)
n = 7, time: 53.625999 (ms)
n = 8, time: 53.646000 (ms)
n = 9, time: 53.665001 (ms)
n = 10, time: 53.763000 (ms)
n = 11, time: 53.783001 (ms)
n = 12, time: 53.801998 (ms)
n = 13, time: 53.821999 (ms)
n = 14, time: 53.841999 (ms)
n = 15, time: 188.386002 (ms)
n = 16, time: 188.408005 (ms)
n = 17, time: 188.423004 (ms)
n = 18, time: 188.440002 (ms)
n = 19, time: 188.455002 (ms)
n = 20, time: 322.501007 (ms)
Processing time: 322.506989 (ms)

So, FLOPS = (1024)^3 * 20 / 0.3225 = 66.5 Gigaflops, which is more reasonable.
Note that the above cumulative time does not increase smoothly, but in steps.

In both cases, the results are correct by checking with a CPU version of the code.

Can FLOPS be measured more repeatably in CUDA?

Are you running the code a few times to smooth out the startup time and such?

Most of the examples do something like that. Haven’t explored it myself, but it seems like a rule of thumb: repeat tests for timing reliability.

The startup time, say, the time for n = 1, varies greatly from run to run and from day to day.

I think I will use the clock on my CPU, which gives more repeatable results, instead of the one

in CUDA.

Thanks.

The first call to CUDA causes driver initialization, you should never time it. Skip the first call and time a few thousand calls after that to get a good average.

Thanks for the suggestion. I am now getting more or less repeatable results close to

40 Gigaflops by running for n = 500 several times.

However, I am still puzzled by the fact that the cummulative time increases in steps.

The following are sample results for n = 500, showing both the time on the CUDA clock

(in ms) and the time on the CPU clock (in s):

n = 486, time: 12734.717773 (ms) or 12.860000 (s)

n = 487, time: 12734.778320 (ms) or 12.860000 (s)

n = 488, time: 12868.706055 (ms) or 12.990000 (s)

n = 489, time: 12868.728516 (ms) or 12.990000 (s)

n = 490, time: 12868.773438 (ms) or 12.990000 (s)

n = 491, time: 12868.848633 (ms) or 12.990000 (s)

n = 492, time: 12868.909180 (ms) or 12.990000 (s)

n = 493, time: 13002.848633 (ms) or 13.130000 (s)

n = 494, time: 13002.871094 (ms) or 13.130000 (s)

n = 495, time: 13002.915039 (ms) or 13.130000 (s)

n = 496, time: 13002.990234 (ms) or 13.130000 (s)

n = 497, time: 13003.048828 (ms) or 13.130000 (s)

n = 498, time: 13136.938477 (ms) or 13.260000 (s)

n = 499, time: 13136.959961 (ms) or 13.260000 (s)

n = 500, time: 13137.004883 (ms) or 13.260000 (s)

Both times increase in steps together, which means that both clocks are correct,

but the FLOPS computed from time(n+1) - time(n) varies from GFlops to TFlops.

why are you puzzled? there are 16 multiprocessors for a total of 128 processors (or maybe less for your card). I see lots of potential for increasing computation time in steps. Often CUDA code is actually bound by memory bandwith, so you can do more calculations for the same amount of running time.

Also you can not calculate FLOPS by comparing n+1 with n.

Do you call cudaThreadSynchronize() before recording times? If not the steps you see are from asynchronous kernel launches. The queue depth is 16 kernels, so you can run up to 16 and measure a “0” time if you don’t synchronize and wait for the kernels to complete.

That’s the problem! After calling cudaThreadSynchronize(), my cummulative time now

increases smoothly:

n = 1, time: 26.754999 (ms) or 0.190000 (s)

n = 2, time: 53.610001 (ms) or 0.220000 (s)

n = 3, time: 80.989998 (ms) or 0.240000 (s)

n = 4, time: 107.703003 (ms) or 0.270000 (s)

n = 5, time: 135.078003 (ms) or 0.300000 (s)

n = 6, time: 161.727997 (ms) or 0.320000 (s)

n = 7, time: 189.078003 (ms) or 0.350000 (s)

n = 8, time: 215.727997 (ms) or 0.380000 (s)

n = 9, time: 243.095001 (ms) or 0.410000 (s)

n = 10, time: 269.779999 (ms) or 0.430000 (s)

As to DenisR’s remark, I can now (after calling cudaThreadSynchronize) actually compute

FLOPS from n to n+1 repeatably. For example, from t1 to t2:

FLOPS = (1024)^3 / (0.05361 - 0.02675) = 40 Gigaflops,

which is the same as what I obtained for n = 500 yesterday.

Just be careful that for some kernels time will not increase smoothly when increasing input size, but in steps. So then doing the incremental count is dangerous. I would do it differently :

calculate GFLOPS for n+1 and for n, and then substracting those values :) Then you can see how big n must be to get the maximum GFLOPS.

I’ll keep that in mind.

For the sake of completeness, let me also quote the FLOP results for my CPU, which is

a single core of a 2.4 GHz quad-core processor. For matrix size of 1024x1024, it is only

0.085 GFlops, while for matrix size of 512x512, it is 0.43 GFlop. (In both cases, CUDA is

close to 40 GFlops.)

I can tell you that 8800GTX vs 1 core of quadcore (3.66 ghz) is a factor of 50 speedup when calling CUDA from matlab vs simple C code called from matlab (both including the matlab calling c overhead) when adding 5500+ sinusses on a irregular grid.

Impressive.

Yeah, up to now the performance is generally mighty impressive

Of course, when you use all four cores and SSE the speedup will suddenly drop to somewhere around 4. Still not bad.

Yep, but I have no idea how to use SSE, too difficult for me :blink:

It really isn’t that bad - this tutorial is decent: http://www.codeproject.com/KB/recipes/sseintro.aspx

I don’t use Visual C++ :D But you are right, that doesn’t look too bad.

Here are my results for different number of cores:

Matrix size 1024x1024:

No. cores GFlops

``````   1             0.062 (varies a bit from run to run)

2             0.093

3             0.102

4             0.103
``````

Matrix size 512x512:

No. cores GFlops

``````   1             0.42

2             0.84

3             1.06

4             1.37
``````

I used OpenMP to parallelize this matrix multiplication code. My experience is that

a 2X speed improvement in going from 1 to 4 cores is typical for large problems.

Would SSE do much better?

I’m fairly sure the same intrinsics work with gcc. Maybe there are some minor differences…

SSE vectorization is orthogonal to threading, so it has the potential of additional speedup.

Paulius