Low performance. whats wrong ?

I’m a newbie to Cuda, but I like it :D

(I’m running XP on a workstation. have a Quadro FX 4600, driver version )
The nBody example runs at about 94 GFlops (as supplied not rebuilt)

I need to run ~15 million threads about 16,000 times for an application. 15 million is to many for GPU so I’ve split them into groups I call ‘Cohorts’
Have adapted the myFirstKernel.cu to the application, performance was way low, so cut it right down to just about nothing (see below), performance is still very low.
I’m building the code in Visual Studio with Release mode and it takes about 9 seconds to run the interations with the settings #defined’d below, regardless of if I run the .exe from a .bat, or run it in Visual Studio (NB time is for the inner loop that calls the kernel only)
Assuming that my kernel is about 5 flop this cut down code is running at about 0.15 GFlops ! :(
I have changed the BlockSize and number of threads I’m running but speed stays about the same.

Is there something obvious I’m doing wrong ?

== Cut down code ==
#define Iterations 40002
#define nThreads 8096
#define BlockSize 256
#define Cohorts 1

global void model( int offset, float* input, float* result)
unsigned int loc = offset + blockIdx.x * blockDim.x + threadIdx.x;
float x = input[loc];
result[loc] = x *0.1231f;

int main( int argc, char** argv)
// usual code for setting up arrays on host and device and copying data to GPU //

int numBlocks = nThreads/BlockSize;
int numThreadsPerBlock = BlockSize;
int blocksPerCohort = numBlocks/Cohorts;

dim3 dimGrid( blocksPerCohort, numThreadsPerBlock );
dim3 dimBlock( numThreadsPerBlock );

int cohort = 0;

// recording start time here
for ( int its = 0; its < Iterations; its++)
// this simple model has 3 params: offset, input, result
model<<< dimGrid , dimBlock >>>( cohort*blocksPerCohort, d_input, d_result);
checkCUDAError(“kernel execution”);
// recording finish time here

// usual code for copying results back, testing results, freeing device and host memory, exit gracefully //

Can you post the command line you use to compile the code? Are you sure you’re not running in emulation mode?



I can’t figure out the sense of

dim3 dimGrid( blocksPerCohort, numThreadsPerBlock );

dim3 dimBlock( numThreadsPerBlock );


Actually #threads == #blocks * #threadsPerBlock but in your code you’re launching

(#blocksPerCohort * #numThreadsPerBlock) * #numThreadsPerBlock threads.

Thank you both.

I misunderstood the second parameter to dimGrid and thought it had to be the number of threads per block.

Changing my code to the following I can now get 10,000 iterations of 4 million threads in under 7 seconds :)

dim3 dimGrid( blocksPerCohort );

Thanks again

PS Compiling with Visual Studio 2008 in Release mode