Confused about GPU vs CPU speed in multiplication

I keep reading that CUDA is really fast but I can’t seem to find a simple demonstration

of a GPU function having a speed increase over a CPU version.

I made a simple multiplication function for the CPU using VS2008 that does a million

multiplications on int’s over a thousand trials, and then I made a GPU version that

does one trial of a million multiplications. I have a Pentium4 3.00GHz and a 9800GT.

My system has 1GB of ram and the GPU has 512MB of ram. I seem to get around a

microsecond or two with the CPU implementation and then anywhere from 45-60

milliseconds with the GPU implementation.

I’ve heard that there’s a lot of overhead in initializing GPU functions, is this what I’m seeing?

Am I doing enough processor calculations to see any difference? Where should I be looking

to optimize GPU code in general?

Could someone provide a simple example of a GPU calculating data faster than a CPU?

Anyways, here’s my code

[codebox]//CPU Implementation

#include “hr_time.h”

int increProduct( int nIncrementTo )

{

int a,i;

for ( i=0; i<nIncrementTo; i++ )

		a = 7*i;

return a;

}

int _tmain(int argc, _TCHAR* argv)

{

const int TRIALS = 1000;

unsigned int j,a;

double total=0;

stopWatch s;

startTimer( &s);

for ( j=0; j<TRIALS; j++)

{

	a = increProduct(1000000);

}

stopTimer(&s);

total = getElapsedTime(&s);

printf("1000000 multiplications took on average of %d trials: %f milliseconds\n", TRIALS, total);

printf("final product: %d\n", a);

system("PAUSE");

delete [] t;

return 0;

}

//GPU Implementation

#define IMUL(a, B) __umul24(a, B)

global

void

MultiplyGPUTest(int N)

{

int a;

int idx = blockIdx.x * blockDim.x + threadIdx.x;



if ( idx < N )

	a = IMUL(7,idx); //the product should be under 24bits, so use 24 bit mul

}

void

runMultiplyGPUTest( int argc, char** argv)

{

// use command-line specified CUDA device, otherwise use device with highest Gflops/s

if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )

	cutilDeviceInit(argc, argv);

else

	cudaSetDevice( cutGetMaxGflopsDeviceId() );

dim3 dimGrid(32768,1,1);

dim3 dimBlock(32,1,1);

unsigned int timer = 0;

cutilCheckError( cutCreateTimer( &timer));

cutilCheckError( cutStartTimer( timer));

int num_elements = 1000000;

MultiplyGPUTest<<<dimGrid,dimBlock>>>(num_elements);

cutilCheckMsg("Kernel execution failed");

cutilCheckError( cutStopTimer( timer));

printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));

cutilCheckError( cutDeleteTimer( timer));

}[/codebox]
multiplyGPU.zip (9.66 KB)
multiplyCPU.zip (3.59 KB)

The amount of work youre doing on the GPU is nothing. The time youre spending is the time it takes to read the data and write it back.
Youre also running only 1 thread block of 32 threads. Your card most likely has upwards of 100 processors, most of which you are not currently using.

–edit, forget the above, wasnt looking at the right “main”.
Your kernel, as written, should take no time at all, since its not doing anything.
It is never writing the results back to memory, so it will be optimized away.

I dont have vs9 or time to do a new project (im that lazy), but microseconds seems a bit fast… maybe its optimizing something away.

Ailleur is correct, because your kernel has no outputs to global memory - it’s all being optimized away to nothing - so in theory your kernel should return instantly.

Additionally, you’re executing your kernel in blocks of 32x1 threads, you’d get better results using larger blocks (192x1 or 256x1), that is - once you start writing your results back to gmem.

Also, integer math isn’t a GPUs specialty - and while that sample will probably be faster than most CPUs - doing a similar test with say, floating point addition/multiplies - would be an even greater speed increase over a CPU. not to mention more complicated kernels that use textures/constant memory/shared memory for storing data structures, or even more extreme (but less useful) - using texture interpolation for even greater speed in some algorithms.

Hey so I tried out some of Smokey and Ailleur’s suggestions of getting the memory accesses to and from the GPU to actually
get some code throughput, using float arrays, and changing the amount of threads used per block. Doing so really trimmed down
the GPU code execution time. I’ll post the code when I get access to my terminal again but basically I changed the int’s to
float arrays, I pushed float arrays to the GPU and back, and then I changed block dim to (64,4) and grid dim to (4096,1).
(6444096 = 1048576 multiplications)

I can only guess that there was some sort of anomaly occurring before, possibly thrashing, because now I’m getting an execution
time of about 1.5ms. Whereas before, for some seemingly simple code that might have just been optimized away, I was
getting an execution time of around 40ms. It probably was conflicting memory accesses though, since I passed the kernel
function a single int value and it was to return as the same value.

I also changed my CPU code to make sure that my timer worked and also that my actual timed code wasn’t being optimized
out and it came out to be around 3ms in execution time.

I’m still wondering what else I could try to do to trim down run-time execution time. From what I understand, pass by value
parameters for global functions automatically get allocated as shared memory but I’m not sure if that’s true. If it’s not true,
would loading my arrays to shared memory and then back to global after the operations were complete actually run faster?
Also, would I need to use some sort of reduction algorithm to fully benefit from that?

Thanks for all your responses!

Pass by value parameters do end up in shared memory (transparently for you).

I think you are making a very apples to oranges comparison. You’ve got your CPU code constantly multiplying into the same memory location => benchmarking CPU floating point speed. On the GPU with only one multiply per thread you are going to be memory bandwidth bound => you are benchmarking the memory bandwidth of the GPU. To make it a fair comparison of memory bandwidth-ish benchmarks, you should allocate a large array (your million entries) and multiply each value of the array storing the result back to the array. This should show some nice speedups vs the CPU (10-20x).

If you are really keen on benchmarking FLOPs alone check out the code Simon Green posted here: http://forums.nvidia.com/index.php?showtop…752&hl=FLOP

I’d check to see if your CPU code is being optimised away.

int increProduct( int nIncrementTo )

{

  int a,i;

  for ( i=0; i<nIncrementTo; i++ )

	a = 7*i;

  return a;

}

Is the same as:

int increProduct( int nIncrementTo )

{

  int a;

  a = 7*(nIncrementTo-1);

  return a;

}

And also:

for ( j=0; j<TRIALS; j++)

 {

	a = increProduct(1000000);

 }

This doesn’t do anything different if run many times. I don’t know if it’s spotted by the compiler, but it’s certainly worth checking.

Thanks for your insight, I’ve taken your post into consideration as well as other things posted in this topic and updated my CPU and GPU code.

If anyone has suggestions, please don’t hesitate to post.

CPU Code

[codebox]int increProduct( int * pRandom, int pIndex, int f1Index, int f2Index)

{

*(pRandom+pIndex) = *(pRandom+factor1)*(*(pRandom+factor2));

return *(pRandom+index);

}

int _tmain(int argc, _TCHAR* argv)

{

const int TRIALS = 1000, NMULTS = 1048576;

unsigned int j,*a, i;

int * pRandom = NULL;

int seed = 10000;

double *t, total=0;

t = new double[TRIALS];

a = new unsigned int[NMULTS];

stopWatch s;

srand(seed);

pRandom = new int[NMULTS];

for (i=0; i<NMULTS; i++ )

	pRandom[NMULTS] = rand()/(RAND_MAX+1);

printf( "CPU Multiply Test\n");

startTimer( &s);

for ( j=0; j<TRIALS; j++)

{

	for ( i=0; i<NMULTS; i++)

		a[i] = increProduct(pRandom, i, i, (NMULTS-i));

}

stopTimer(&s);

total = getElapsedTime(&s);

printf("%d multiplications took on average of %d trials: %f seconds\n", NMULTS, TRIALS, total/1000);

printf("product at index %d: %d\n", TRIALS-1, a[TRIALS-1]);

system("PAUSE");

delete [] t;

delete [] a;

return 0;

}[/codebox]

GPU Code

[codebox]global

void

MultiplyGPUTest(float* a, float* b, int N)

{

//int a;

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x; //Get our unique id for indexing.

float fac = *(a+idx);									  //Get the value of the input array, floats yield the best math performance.

if ( idx < N ){											  //If the current id is within our number of multiplications, square the

														  //the input value and place it in the output array.

	*(b+idx) = fac*fac;//*(a+idx) * (*(a+idx));//IMUL(idx,idx); 

}

}

void

runMultiplyGPUTest( int argc, char** argv)

{

printf( "GPU Multiply Test\n");

// use command-line specified CUDA device, otherwise use device with highest Gflops/s

if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )

	cutilDeviceInit(argc, argv);

else

	cudaSetDevice( cutGetMaxGflopsDeviceId() );

dim3 dimGrid(4096,1,1);					//Dimensions of blocks in the grid, x,y,z max dimensions are 65535,65535,65535 respectively.

dim3 dimBlock(256,1,1);					//Dimensions of threads in each block, x*y*z must be <=512, max dimensions are 512,512,64.

unsigned int timer = 0;					//CUDA timer

float* h_a,* d_a, *d_b;					//Host and Device variables

int i,j,outputIndex=50, nTrials=100000; //test variables

unsigned int num_elements = 1048576;	//number of multiplications, 1 multiplication per function call

h_a = (float *)malloc(num_elements*sizeof(float)); //allocate array memory on the host to store our values

//memset( h_a, 1.f, num_elements*sizeof(float) );	//initialize data in array to a value

for ( j=0; j<num_elements; j++)					   //initialize our host array

	*(h_a+j) = j;

cutilSafeCall( cudaMalloc((void **)&d_a, num_elements*sizeof(float)) ); //allocate memory on the device for the input values

cutilSafeCall( cudaMalloc((void **)&d_b, num_elements*sizeof(float)) ); //allocate memory on the device for the output values

cutilSafeCall( cudaMemcpy(d_a, h_a, num_elements*sizeof(float), cudaMemcpyHostToDevice) ); //copy the host data over to the device

																						   //host var -> device var

//warm-up call, for accurate timing, we need an initial kernel call.

MultiplyGPUTest<<<dimGrid,dimBlock>>>(d_a,d_b,num_elements);	 //kernel call, Foo<<<blocks,threads>>>(params)

cutilCheckMsg("Kernel execution failed");						 //kernel call check

cutilSafeCall( cudaThreadSynchronize() );						 //make sure the kernel is done before continuing

cutilCheckError( cutCreateTimer( &timer)); //initialize our timer

cutilCheckError( cutStartTimer( timer));						 //start the timer

for ( i=0; i<nTrials; i++ ) {									 //call the kernel function nTrials amount of times

	MultiplyGPUTest<<<dimGrid,dimBlock>>>(d_a,d_b,num_elements); //kernel call, Foo<<<blocks,threads>>>(params)

	cutilCheckMsg("Kernel execution failed");					 //kernel call check

	cutilSafeCall( cudaThreadSynchronize() );					 //make sure the kernel is done before continuing

}

cutilCheckError( cutStopTimer( timer));							 //stop the timer, returns the value into timer

cutilSafeCall( cudaMemcpy(h_a, d_b, num_elements*sizeof(float), cudaMemcpyDeviceToHost) ); //copy the device data over to the host

																						   //device var -> host var

printf( "Average processing time for %d trials: %f (ms)\nValue at index %d : %f\n", //print out test data

		nTrials, cutGetTimerValue(timer)/nTrials, outputIndex, *(h_a+outputIndex));

cutilCheckError( cutDeleteTimer( timer));

free( h_a);							//host memory deallocation

cutilSafeCall(cudaFree(d_a));		//device memory deallocation

cutilSafeCall(cudaFree(d_b));		//device memory deallocation

}

[/codebox][attachment=11099:MultiplyCPU.zip]
MultiplyGPU_2_13_2009.zip (8.68 KB)
MultiplyCPU_2_13_2009.zip (4.46 KB)

I am new to GPU, I hope I can use GPU parallel capability to do some application.
A part of my future appliaction is to compute the percentile (say 5%, 50% ,95%, or at the same time) value. You can also think it as a sorting problem.
For example, when data of row x 1 (25000 x 1) is easy for both CPU and GPU.
But for 25000 x 1000 (or 25000x 2000) is not quick for CPU.

My question is , can someone help to give me the example to demo the capability for GPU to calculate 1000 (or 2000) columns by GPU.
Is this make sense to use GPU ?
:rolleyes: