I noticed that the GFLOPS value of my multi GPU CUFFT is coming almost the same (see the attached picture) as using 1 GPU, the timings I took were of the CUFFT kernel…I expected the GFLOPS to be increasing with increasing number of GPUs, till it flattens out at a point.
GFLOPS = 1E-09*5*n*m*log2(n*m)/(kernel-execution-time-in-secs)
When I measure the performance of a GPU program, I generally only include the kernel execution timings; is this a wrong thing to do? Do I need to include host<->device memcpy as well?
Thanks.