GFLOP question

Hi,

I might be missing something here (or too tired :) ) I have a GTX280 and use the following code:

If I calculate the GFLOPS correctly I get : 26 * 26 * 3 * 4 * 256 * 2700 * 3

which stands for : dimGrid.x * dimGrid.y * TimeLoops * threads * TraceIndex * operations in the inside loop

This takes ~87ms → that means I get ~180GFLOPS ???

Any suggestions are more then welcomed…

dim3 mydim( 26, 26 * 3 );

kernel<<< mydim, 256 >>>( pOut);

[codebox]

global void kernel( float *pOut )

{

float f1 = 0.0f;

for ( int iCurrentTimeLoop = 0; iCurrentTimeLoop < 4; iCurrentTimeLoop++ )

{

   for ( int iTraceIndex = 0; iTraceIndex < 2700; iTraceIndex++ )   

   {

      f1 += iTraceIndex; 

   }

   pOut[ threadIdx.x ] += f1;

}

[/codebox]

Your kernel is more likely memory bandwidth bound.
All your blocks are updating the same elements in the output array, so you cannot reliably calculate how many GB/s it is doing. If you change that you can calculate how many GB/s you are doing and compare that to the theoretical memory bandwidth

the following code should be quite a bit faster:

__global__ void kernel( float *pOut )

{

   float f1 = 0.0f;

   float tmp=0.f;

   for ( int iCurrentTimeLoop = 0; iCurrentTimeLoop < 4; iCurrentTimeLoop++ )

  {

	   for ( int iTraceIndex = 0; iTraceIndex < 2700; iTraceIndex++ )   

	   {

		  f1 += iTraceIndex; 

	   }

	   tmp += f1;

   }

   pOut[ threadIdx.x ]+=tmp;

}

just to show you, that you were bandwidth-bound. ;-)