Low performance. whats wrong ?

kbam · May 5, 2009, 2:38am

Hi
I’m a newbie to Cuda, but I like it :D

(I’m running XP on a workstation. have a Quadro FX 4600, driver version 6.14.11.8120 )
The nBody example runs at about 94 GFlops (as supplied not rebuilt)

I need to run ~15 million threads about 16,000 times for an application. 15 million is to many for GPU so I’ve split them into groups I call ‘Cohorts’
Have adapted the myFirstKernel.cu to the application, performance was way low, so cut it right down to just about nothing (see below), performance is still very low.
I’m building the code in Visual Studio with Release mode and it takes about 9 seconds to run the interations with the settings #defined’d below, regardless of if I run the .exe from a .bat, or run it in Visual Studio (NB time is for the inner loop that calls the kernel only)
Assuming that my kernel is about 5 flop this cut down code is running at about 0.15 GFlops ! :(
I have changed the BlockSize and number of threads I’m running but speed stays about the same.

Is there something obvious I’m doing wrong ?
Thanks

== Cut down code ==
#define Iterations 40002
#define nThreads 80964
#define BlockSize 256
#define Cohorts 1

global void model( int offset, float* input, float* result)
{
unsigned int loc = offset + blockIdx.x * blockDim.x + threadIdx.x;
float x = input[loc];
result[loc] = x *0.1231f;
}

int main( int argc, char** argv)
{
// usual code for setting up arrays on host and device and copying data to GPU //

int numBlocks = nThreads/BlockSize;
int numThreadsPerBlock = BlockSize;
int blocksPerCohort = numBlocks/Cohorts;

dim3 dimGrid( blocksPerCohort, numThreadsPerBlock );
dim3 dimBlock( numThreadsPerBlock );

int cohort = 0;

// recording start time here
for ( int its = 0; its < Iterations; its++)
{
// this simple model has 3 params: offset, input, result
model<<< dimGrid , dimBlock >>>( cohort*blocksPerCohort, d_input, d_result);
checkCUDAError(“kernel execution”);
}
// recording finish time here

// usual code for copying results back, testing results, freeing device and host memory, exit gracefully //

eyalhir74 · May 5, 2009, 5:58am

== Cut down code ==

define Iterations 4000*2

define nThreads 8096*4

define BlockSize 256

define Cohorts 1

global void model( int offset, float* input, float* result)

{

unsigned int loc = offset + blockIdx.x * blockDim.x + threadIdx.x;

float x = input[loc];

result[loc] = x *0.1231f;

}

int main( int argc, char** argv)

{

// usual code for setting up arrays on host and device and copying data to GPU //

int numBlocks = nThreads/BlockSize;

int numThreadsPerBlock = BlockSize;

int blocksPerCohort = numBlocks/Cohorts;

dim3 dimGrid( blocksPerCohort, numThreadsPerBlock );

dim3 dimBlock( numThreadsPerBlock );

int cohort = 0;

// recording start time here

for ( int its = 0; its < Iterations; its++)

{
   // this simple model has 3 params: offset, input, result

  model<<< dimGrid , dimBlock >>>( cohort*blocksPerCohort, d_input, d_result);

  checkCUDAError("kernel execution");
}

// recording finish time here

// usual code for copying results back, testing results, freeing device and host memory, exit gracefully //

Can you post the command line you use to compile the code? Are you sure you’re not running in emulation mode?

eyal

navier-stokes · May 5, 2009, 6:54am

Hi!

I can’t figure out the sense of

dim3 dimGrid( blocksPerCohort, numThreadsPerBlock );

dim3 dimBlock( numThreadsPerBlock );

.

Actually #threads == #blocks * #threadsPerBlock but in your code you’re launching

(#blocksPerCohort * #numThreadsPerBlock) * #numThreadsPerBlock threads.

kbam · May 6, 2009, 12:25am

Hi!

I can’t figure out the sense of
dim3 dimGrid( blocksPerCohort, numThreadsPerBlock );

dim3 dimBlock( numThreadsPerBlock );
.

Actually #threads == #blocks * #threadsPerBlock but in your code you’re launching

(#blocksPerCohort * #numThreadsPerBlock) * #numThreadsPerBlock threads.

Thank you both.

I misunderstood the second parameter to dimGrid and thought it had to be the number of threads per block.

Changing my code to the following I can now get 10,000 iterations of 4 million threads in under 7 seconds :)

dim3 dimGrid( blocksPerCohort );

Thanks again

PS Compiling with Visual Studio 2008 in Release mode

Topic		Replies	Views
Understanding Threads in CUDA help me find the exact number of threads for my code CUDA Programming and Performance	4	2334	July 13, 2009
Bad performance problems and discussion CUDA Programming and Performance	1	576	May 17, 2016
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7130	December 6, 2007
time problems with big grid CUDA Programming and Performance	12	878	September 14, 2017
Looking for kernel performance suggestions CUDA Programming and Performance	12	55	August 23, 2024
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10263	June 5, 2009
CUDA trouble CUDA Programming and Performance	3	977	March 19, 2013
2-D Memory Allocation Issues CUDA Programming and Performance	4	4805	July 15, 2009
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2310	August 12, 2009
Strange Behavior on image processing CUDA Programming and Performance	3	1795	September 8, 2008

Low performance. whats wrong ?

Related topics