Question about measuring GFLOPS


I noticed that the GFLOPS value of my multi GPU CUFFT is coming almost the same (see the attached picture) as using 1 GPU, the timings I took were of the CUFFT kernel…I expected the GFLOPS to be increasing with increasing number of GPUs, till it flattens out at a point.

GFLOPS = 1E-09*5*n*m*log2(n*m)/(kernel-execution-time-in-secs)

When I measure the performance of a GPU program, I generally only include the kernel execution timings; is this a wrong thing to do? Do I need to include host<->device memcpy as well?


Screen shot 2012-02-23 at 11.02.04 PM.png

I had previously summed the kernel times, and hence the results were coming incorrect ( ). Here is my current plot, I noticed that if I include only the kernel time then I get huge GFlops value, so I included the plan creation. Including memcpy further deteriorates the counts, while I am not trying to fluff up numbers, but just trying to come up with reasonable numbers.