Performance Boost Not Really Seen

Stu2000 · December 20, 2010, 1:27pm

Hey,
I’m new at CUDA and have been really excited about it, so I bought the book ‘Cuda by example’ and got all set up.
I got to about the end of chapter 3, when I decided I wanted to perform basic test of CPU against GPU with following code:

#include
#include <time.h>
#include <windows.h>

#define N 65530

global void add(int* a, int* b, int *c)
{
int tid = blockIdx.x;
if(tid < N)
{
for(int i=0; i<65000; i++)
{
c[tid] = a[tid] + b[tid];
}
}
}

void cpuAdd(int* a, int* b, int *c)
{
for(int s=0; s<N; s++)
{
for(int i=0; i<65000; i++)
{
c[s] = a[s] + b[s];
}
}
}

int main(void)
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;

cudaMalloc((void**) &dev_a, N *sizeof(int) );
cudaMalloc((void**) &dev_b, N *sizeof(int) );
cudaMalloc((void**) &dev_c	, N *sizeof(int) );


//fill the arrays a and b on cpu
for(int i=0; i<N; i++)
{
	a[i] = -i;
	b[i] = i * i;
}


//copy the arrays a and b to the device
DWORD start = GetTickCount();
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

    add<<<N, 1>>>(dev_a, dev_b, dev_c);

//copy the results to the computer
cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost );
DWORD finish = GetTickCount();
DWORD timeTaken = finish - start;

//display the results
printf("it took %ld time to get the results \n", timeTaken);

//now test if was cpu
start = GetTickCount();

cpuAdd(a, b, c);
	
finish = GetTickCount();
timeTaken = finish - start;

//display the results
printf("it took %ld time to get the results from CPU \n", timeTaken);

//free the memory from the device
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );

getchar();
return 0;

}

Unfortunately the results are not as good as I had hoped. Without the ‘for(int i=0; i<65000; i++)’ loop in the add functions, the CPU beats the GPU by a long time. With it, the GPU only just beats the CPU.

This is the results with for loop.
External Media

This is the results without the for loop.
External Media

I realise that creating the threads and transferring the data to/from device is costly, but I would have thought with that by increasing the thread/function order, the GPU would do a lot better than it does against the CPU. The results shown are from using my laptop, but i get the same rsults relative to each other from my tower with NVIDA gtx 260 card, just shorter times. I realise this code is probably really inefficient, using one thread per block etc, but that’s how far I am at at the moment.

Am I doing something that gives the CPU an unfair advantage? I.e. am I comparing apples with oranges?

Best Wishes,
Stu

seibert · December 20, 2010, 1:45pm

The most immediate issue is that you are running with one thread per block. CUDA devices are not multicore processors, but are rather “vector machines” wrapped up in an interesting software interface. The hardware is designed to execute groups of 32 threads (called “warps”) at a time. A warp only can contain threads from the same block, so when you run a block with only 1 thread, the block has 1 warp with 31 empty slots. The warp takes pretty much the same time to execute no matter what, so this code has handicapped the GPU by a factor of 32.

Stu2000 · December 20, 2010, 2:25pm

Ah thank you so much! I take it that it was able to go slightly faster because it can run more than 1 warp at a time? That answers my question perfectly. I will try again.

Dittoaway · December 20, 2010, 2:37pm

I’m surprised that the loop ‘for(int i=0; i<65000; i++)’ wasn’t thrown away by the compiler. Do you have optimization turned off?

Stu2000 · December 20, 2010, 9:39pm

I havent manually set any optimization stuff…

I changed the code to use multiple threads and now i get results like shown in this graph:

External Media

(x axis: number of times * factor of adding two vectors of 65536 ints together, y axis, time taken in ms to do so)

That looks sort of correct. Unfortunately when i moved the exe from my laptop to my desktop tower computer with dedicated gfx card (32bit to 64bit computer). i keep getting 0 and 15 for the times insted of a range from 200 - 600 on the gfx card, suggesting I think that something has gone wrong.

Surely the exe should run just the same on both machines right?

i mean a 32 bit exe will run fine on 64 bit machine correctly right?

Stu2000 · December 21, 2010, 6:33am

nvm the previous post is pointless. Discovered that im getting error code 11 once and error code 30’s on my cuda memmory allocations. Will have to fix that first.

int a[N], b[N], c[N];

int *dev_a, *dev_b, *dev_c, *dev_n;

DealWithResult( cudaMalloc((void**) &dev_a, N *sizeof(int) ) ); //these are fine

DealWithResult( cudaMalloc((void**) &dev_b, N *sizeof(int) ) );//these are fine

DealWithResult( cudaMalloc((void**) &dev_c, N *sizeof(int) ) );//these are fine

DealWithResult( cudaMalloc((void**) &dev_n, sizeof(int) ) //these are fine

//------------------------Graphics card test------------------

for(n=0; n<=2; n++)

{	

	start = GetTickCount();

	

	for(int s=0; s<100; s++)

	{

		DealWithResult(cudaMemcpy(dev_n, &n, sizeof(int), cudaMemcpyHostToDevice)); //cant do this more than once

		//number of blocks, then threads per block

		add<<<(N+127)/ 128, 128>>>(dev_a, dev_b, dev_c, dev_n);

		//copy the results to the computer

		DealWithResult( cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost )); //cant do this more than once

		DealWithResult( cudaFree( dev_n )); //cant do this more than once

	}

	

	finish = GetTickCount();

	timeTaken = finish - start;

	printf("%i, GPU, %d\n", n, timeTaken);

}

//display the results







//free the memory from the device

cudaFree( dev_a );

cudaFree( dev_b );

cudaFree( dev_c );

Dittoaway · December 21, 2010, 2:42pm

DealWithResult( cudaFree( dev_n )); //cant do this more than once

That shouldn’t be in the loop!
You’re freeing the memory that you want to use at the top of the loop.

Stu2000 · December 21, 2010, 7:45pm

but im resetting dev_n at the top of the loop:

DealWithResult(cudaMemcpy(dev_n, &n, sizeof(int), cudaMemcpyHostToDevice)); //cant do this more than once

im sorry i dont quite know how it works yet looking into it in my free time.

im adjusting n and i want it to be changed on the gpu as well so im constantly freeing the memory and then memcopying again. This is no doubt wrong for some reason.

Dittoaway · December 21, 2010, 8:14pm

I don’t understand what you think your code is doing, but you do have at least one fundamental misunderstanding.

What you have to do is:

Allocate memory on the device with cudaMalloc

Do things to that memory (perhaps in a loop) such as

memcopy to it
perform mathematical operations on it in a kernel
memcopy from it
etc.

When you are finished, de-allocate that memory i.e. cudaFree it. At that point it is no longer available for use.

cudaMemcpy(dev_n, &n, sizeof(int), cudaMemcpyHostToDevice) does not “re-set” the memory. It copies values from host memory to device memory. That memory has to have been previously allocated.

Topic		Replies	Views
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6557	February 19, 2009
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8636	December 18, 2008
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13527	July 9, 2008
processing time check CUDA Programming and Performance	5	552	November 16, 2010
Help on fixing some poor performances (rookie) CUDA Programming and Performance	10	7174	November 28, 2007
Massive "simple" computation with CUDA CUDA Programming and Performance	14	8610	December 7, 2009
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7159	December 6, 2007
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7895	August 16, 2007
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	695	April 4, 2017
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9363	January 7, 2008

Performance Boost Not Really Seen

Related topics