How to compute the GFLOPS of a program?

In some papers, authors often use GFLOPS as the benchmark to evaluate the application efficiency. However, I want to know how to compute the value of GFLOPS.
Take the matrix multiplication as an example:
(the italic sentences below is extracted from a paper)
1) The 16x16 tiled version of matrix multiplication achieves 46.49 GFLOPS.
2) By examining the PTX for this code, we find that there is approximately one fused multiply-add out of eight operations in the inner loop, for a estimated potential throughput of 43.2 GFLOPS.
(But I find that two fused multiply-add of eight operations in the inner loop.)
So, I have several questions to ask:

  1. How the GFLOPS values above do is computed?
  2. For any program, is there a general formula to compute the GFLOPS?

Thanks a lot in advance!
Best regards.

Yes. You walk through your kernel code and count the number of FLOPs in each thread (the general rule is to count +, -, , /, sin, ln, sqrt, rsqrt, etc… each as 1 FLOP). Then you take num_threadsflops_per_thread / seconds_of_calculation to get FLOP/s.

Comparing FLOP/s between several algorithms is only meaningful if those two algorithms employ the exact same operations, just in a different order or configuration. As an example of where counting GLOP/s doesn’t make sense I can say one kernel A in my code gets 250 GFLOP/s! “That’s great!”, you would say. But I like kernel B much better, even though it only sustains 60 GFLOP/s…

See: kernel A is O(N^2) and about 10,000 times slower than an kernel B which does the same thing in O(N) time. Of course, you can take the astronomers’ route (this is only standard in astronomy AFAIK) and write the GFLOP/s that your O(N) algorithm would do if it were an O(N^2) one… In that sense: kernel B does 10,000 * 60 = 600,000 GLOP/s!

In short, counting GFLOP/s is only an estimate, it is confusing, and there are a number of standards that people follow so you must explain exactly what you mean. I prefer counting run time or operations per second above FLOP/s by far because you can compare these values between widely different algorithms that calculate the same final result.

With that being said, counting the real GFLOP/s a piece of code sustains is useful, if only to see how closely you are pushing to the hardware limits. Similarly, you can count the number of GiB/s of global memory bandwidth to see how effectively you are pushing that limit of your device (the matrixMul sample should do this… it would show where the bottleneck is).

I wish there were performance counters like on the Intel CPUs.

Counting gigaflops is only done for matrix multiplication. Maybe you could do it for another fundamental algorithm that can’t be simplified. But for any real application, it’s a pointless exercise. In optimizing, you’re trying to get rid of ops, total, even if you get fewer ops per second (maybe because you couldn’t get rid of as many memory accesses).

Also, for matrix-multiply it’s possible to really get close to 100% on most hardware, and you know what you’re comparing your result against. But with a real-world app, what is your benchmark for “efficient?” The most ideal version of your app might get 3% or 30%. But you have no idea, and whatever result you come up with, eg 5%, doesn’t tell you anything whether your code was efficient or not. (Again, discounting the fact that out of those 5%, 4.9% may be unnecessary work.) There’s no point.

The only valid unit of measure for performance is results per time. Not operations per time, not the-operations-itd-take-some-other-guy per time. Just answers per time. It just so happens that for matrix-mul, the same results always take the same steps and so people settled on this flashy but deceptive unit of measure.

I think it is useful for checking if you are compute bound (for me in theory, I really would like to meet someone that is compute bound ;), your first check should always be to see if you are memory-bandwidth bound)

Also in optimizing I find myself not trying to get rid of operations, I will happily add some ops to my kernel if I can get rid of memory reads. I have cycles left to burn in practically all of my kernels that take a significant amount of time.

Optimizing consists as far as I can tell (in order of importance)

  1. coalescing memory access

(1,5 if you are using so many registers that you cannot hide the memory-latency anymore, you have to try to minimize the amount of registers)

  1. minimizing the amount of memory accesses

  2. getting rid of divergent branches

  3. getting rid of bank conflicts in shared memory

  4. getting rid of operations

I am not 100% sure of the order (but I am sure I am not far off B) )

That’s a good list. It should be like in the CUDA docs.

That is a pretty good list. Coalescing (either by reorganizing your data or using shared memory as an intermediate buffer) is the number one way to increase performance, and compared to uncoalesced code, you’re often looking at a 3-5x increase in perf. Minimizing accesses is also good, considering memory bandwidth is usually the bottleneck.

The other things I would consider (at number 6 and below) are using registers instead of shared memory when possible (mainly on GT200), minimizing the number of serializations due to constant memory and minimizing atomics.

It is what I remember from another list that is to be found somewhere on the forum. I am afraid I am not the one that came up with it. (although I am now as far as finally understanding why :D )

Hmm, using registers instead of shared memory is not something I would think of. I would guess that is more for optimizing existing algorithms (where one sometimes offloads registers to shared memory) for GT200.

Is there a place on the CUDA zone where someone (you e.g. ;)) could keep such a list? Or maybe put it in a sticky post at the top of the programming forum?

This is something I’ve heard someone else mention. In what way are registers slower than shared mem, and how much? Also, what exactly are serializations, and how bad are they? (I guess they’re like an implicit syncthreads. But do they have an intrinsic performance hit, or are they important only when warps have asynchronous workloads?)

Generally, thread serializations are when you read a number of different addresses from constant memory within a single half-warp. Check 5.1.2.3. It’s not a huge penalty, but it’s something to be aware of when you’re using constant memory. I believe using atomics can cause serializations (as shown in the profiler) as well, so if you’re not using constant memory or if you’re using constant memory in such a way that shouldn’t serialize, there can be other causes.

As far as registers being slower than shared memory… this is a tricky subject. This thread (unsurprisingly, a vvolkov thread) is a good introduction to the subject matter. But there’s a reason this is number 6–if you’ve squeezed every last drop of memory bandwidth out of the card and you’re effectively hiding latency, then this is something to consider if it doesn’t break 1 through 5. The moral of the story is really just to not use shared memory as registers if you can avoid it. If you can’t, it’s really not a big deal, as you’ll probably be bottlenecked elsewhere, but just keep it in mind.

Denis, there are lots of things I’d like to do to the forums, but I only have so much time in the day :(

I understand, but as my boss always said, there are 24 hours in a day, and then you have the night to sleep ;) I have a better proposition for you:

If ‘the forum’ came up with a text that includes such a list. And the text & list would seem accurate to the NVIDIA people. Could you make a sticky post with the contents? (So we do the work :D) We normal users are not able to make sticky posts, that was more the reason for my question. Such a post is, I think, a great thing to point new users at when they have questions. It is much easier to find than (posts in) a normal thread.

I often find myself rewriting things because finding back the posts where it was written before takes more time :(

Keeping the forums a bit better up to date is on my radar–hopefully something will happen soon.

I’ve also heard ‘serialization’ in connection with smem bank conflicts. Hold on, so is ‘serialization’ a more general term for bank conflict? (its meaning being that the threads inside a half-warp aren’t executed in parallel) And obviously when you issue an atomic the threads act like there is one, big ‘bank conflict.’

Ok, that makes sense now. I really wish the profiler doc’s would explain all this (instead of c&p the developer’s internal GUI spec… what’s up with that?)

To that you should probably add “(includes getting rid of local memory usage)” - in the worst case by using a “volatile shared” variable/array manually.

I didn’t know the way how we calculate the GFlop rating for the programs we write. This helped me a lot. But it is showing very huge rating for matrix sizes greater than 3000. The following code is the one written by me with help of book “Programming massively parallel processors”(NVIDIA), pg no: 83. I would be very grateful if I can know the mistake I have done in calculating the Flop rating.

#include<stdio.h>

#include<cuda.h>

#define TILE_WIDTH 16

global void mmmul(float *A, float *B, float *C, int len)

{

    __shared__ float As[TILE_WIDTH][TILE_WIDTH];

    __shared__ float Bs[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x;

    int by = blockIdx.y;

    int tx = threadIdx.x;

    int ty = threadIdx.y;

    int row = by * TILE_WIDTH + ty;

    int col = bx * TILE_WIDTH + tx;

    float val = 0;

    for(int m=0; m<len/TILE_WIDTH; m++)

    {

            As[ty][tx] = A[ row * len + (m * TILE_WIDTH + tx)];

            Bs[ty][tx] = B[ (m * TILE_WIDTH + ty) * len + col];

            __syncthreads();

            for(int k=0; k<TILE_WIDTH; k++)

                    val += As[k][ty] * Bs[tx][k];

            __syncthreads();

    }

    C[ row * len + col] = val;

}

int main(int argc, char *argv)

{

int mlen = atoi(argv[1]);

int size = mlen * mlen * sizeof(float);

float *hostA, *hostB, *hostC;

float *devA, *devB, *devC;

cudaEvent_t start, stop;

hostA = (float *)malloc(size);

hostB = (float *)malloc(size);

hostC = (float *)malloc(size);

for(int i=0; i<mlen; i++)

{

    for(int j=0; j<mlen; j++)

    {

            hostA[i * mlen +j] = 2.00f;

            hostB[i * mlen +j] = 2.00f;

    }

}

cudaMalloc((void **)&devA, size);

cudaMalloc((void **)&devB, size);

cudaMalloc((void **)&devC, size);

cudaMemcpy((void*)devA,(void*)hostA,size,cudaMemcpyHostToDevice);

cudaMemcpy((void*)devB,(void*)hostB,size,cudaMemcpyHostToDevice);

int m = mlen * mlen/(TILE_WIDTH * TILE_WIDTH);

if(((mlen * mlen) % (TILE_WIDTH * TILE_WIDTH)) !=0)

m = m + 1;

dim3 threadsperblock(TILE_WIDTH, TILE_WIDTH);

dim3 blockspergrid(m,1);

cudaEventCreate (&start);

cudaEventCreate (&stop);

cudaEventRecord (start, 0);

mmmul<<<blockspergrid, threadsperblock>>>(devA, devB, devC, mlen);

cudaEventRecord (stop, 0);

cudaEventSynchronize(stop);

float elapsedTime;

cudaEventElapsedTime ( &elapsedTime, start, stop);

float Tsec = elapsedTime *1.0e-3;

float gflops = ((TILE_WIDTH * TILE_WIDTH * m *(6 + (3 * mlen) + ((float)(mlen/TILE_WIDTH) * 9)))/ Tsec) * 1.0e-9;

printf("glop rating : %f ",gflops);

cudaMemcpy((void*)hostC,(void*)devA,size,cudaMemcpyDeviceToHost);

/for(int i=0;i<mlenmlen;i++)

    printf("%f  ",hostC[i]);*/

cudaEventDestroy(start);

cudaEventDestroy(stop);

cudaFree(devA);

cudaFree(devB);

free(hostA);

free(hostB);

free(hostC);

}

Thank you.

This part of the code is taken from matrixmatrix multiplication code provided by the nvidia SDK-4.0

double dSeconds = cutGetTimerValue(timer_matrixMul)/((double)nIter * 1000.0);

            double dNumOps = 2.0 * (double)uiWA * (double)uiHA * (double)uiWB;

            double gflops = 1.0e-9 * dNumOps/dSeconds;

Here, it looks like the number of operations is not calculated according to tha formula. Can you help me with this? I have attached the entire code.

matrixMul.zip (7.21 KB)