Cuda's flops is too small 8800GT only has 16Gflops?

Hi.

I’ve rewritten the scan example programs as shown below so as to estimate
the speed of CUDA on 8800GT.
The size of block and grid is 256 and 256256. Each thread has a loop
repeating num = (num+2.0)/(num+1.8). To eliminate the effect of memory accessing,
bank collision and thread-creating overhead etc., I took the difference between
10 loops and 20 loops.
The time is 123 ms for imax=10 and 155ms for imax=20, i.e. 32ms for 10 loops.
This means that the speed is 256
25625610*(1000/32)*3 = 16G flops. (3 is the
three floating point calculation in (num+2.0)/(num+1.8) )
This is much smaller than the officially announced 350G flops of 8800GT.

Culnd anyone tell me what is wrong?

work_efficient.cu:

global void scan_workefficient(float *g_odata, float *g_idata, int n)
{
int thid = threadIdx.x;

float num = 2.3;

int counter = 0;
// To eliminate the effect accessinf memory etc.,
//I take the diffence between the two cases below
int imax = 10; 
//int imax = 20; 
for (counter = 0; counter < imax; counter++) {
	num = (num+2.0)/(num+1.8);		
}
__syncthreads();

g_odata[2*thid] = num;

}

scan.cu (abbreviated):
dim3 grid(512, 512, 1);
dim3 threads(512, 1, 1);

cutStartTimer(timer);
scan_workefficient<<< grid, threads, shared_mem_size >>>
    (d_odata[1], d_idata, num_elements);
cudaThreadSynchronize();
cutStopTimer(timer);
printf("Average time: %f ms\n\n", cutGetTimerValue(timer));

thank you.

According to the manual (chapter 5 "Performances Guidelines - 5.1.1.1 Arithmetic instructinos),

there’s no floating point division, only 1/n and that is a very slow instruction (16clocks compared to 4 clocks for the others). So the division is actually done by 2 steps : first 1/n (which cost the same time as 4 operation) and then multiplication (1 more operation). Correct formula should then use a factor 7.

Also there’s the loop, which also takes up cycles. You should have a look at the output PTX to see how much unrolling is performed by compiler optimisation. Although the scheduler is supposed to hide those cycle-costs by multithreading.

Also maybe using a bigger grid to minimise other influences on time ? (256x266 of 256x256 blocks ?)

Finally there’s some startup time the first time the kernel is started. You should make a loop in the main file and take the median of run times.

__syncthreads();

I don’t see the point in synchronizing threads in a warp as none of the does actually depend on the output of others. Maybe you should try without (to avoid some overhead induced by thread swapping maybe ?)

The scan in the SDK is not the most optimized version. Refer to the slides from Supercomputing 2007 for a discussion of all optimizations for scan.

Also, keep in mind that scan is very memory intensive, and efficiency of most parallel (not just parallel for GPU) algorithms is 50%. So, it’s not a good candidate to evaluate a device’s GLFOPS capabilities (unless scan is precisely what you’re interested in).

Paulius

P.S. 50% efficiency meaning that a parallel algorithm does twice the work of a sequential algorithm.