How do you measure the GFLOPS for your kernel?

I’ve been busy reading about the capabilities of the GPUs. I’ve read that they can reach up to 500GFLOPS.
I’ve been programming in CUDA for some weeks now but I couldn’t find any information on how to calculate the GFLOPS.
To measure some performance parameters, I’ve been using the Visual Profiler, which reports for every kernel the number of instructions and the instruction throughput.
In the help file they define the instruction throughput as follows:

“Instruction throughput ratio. This is the ratio of achieved instruction rate to peak single issue instruction rate. The achieved instruction rate is calculated using the “instructions” profiler counter. The peak instruction rate is calculated based on the GPU clock speed. In the case of instruction dual-issue coming into play, this ratio shoots up to greater than 1.”

This sound like Chinese to me…
Can someone explain me this, or tell me how to calculate the GFLOPS based on this information or somehow else?