Performance Boost Not Really Seen

Hey,
I’m new at CUDA and have been really excited about it, so I bought the book ‘Cuda by example’ and got all set up.
I got to about the end of chapter 3, when I decided I wanted to perform basic test of CPU against GPU with following code:

#include
#include <time.h>
#include <windows.h>

#define N 65530

global void add(int* a, int* b, int *c)
{
int tid = blockIdx.x;
if(tid < N)
{
for(int i=0; i<65000; i++)
{
c[tid] = a[tid] + b[tid];
}
}
}

void cpuAdd(int* a, int* b, int *c)
{
for(int s=0; s<N; s++)
{
for(int i=0; i<65000; i++)
{
c[s] = a[s] + b[s];
}
}
}

int main(void)
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;

cudaMalloc((void**) &dev_a, N *sizeof(int) );
cudaMalloc((void**) &dev_b, N *sizeof(int) );
cudaMalloc((void**) &dev_c	, N *sizeof(int) );


//fill the arrays a and b on cpu
for(int i=0; i<N; i++)
{
	a[i] = -i;
	b[i] = i * i;
}


//copy the arrays a and b to the device
DWORD start = GetTickCount();
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

    add<<<N, 1>>>(dev_a, dev_b, dev_c);

//copy the results to the computer
cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost );
DWORD finish = GetTickCount();
DWORD timeTaken = finish - start;

//display the results
printf("it took %ld time to get the results \n", timeTaken);

//now test if was cpu
start = GetTickCount();

cpuAdd(a, b, c);
	
finish = GetTickCount();
timeTaken = finish - start;

//display the results
printf("it took %ld time to get the results from CPU \n", timeTaken);

//free the memory from the device
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );

getchar();
return 0;

}

Unfortunately the results are not as good as I had hoped. Without the ‘for(int i=0; i<65000; i++)’ loop in the add functions, the CPU beats the GPU by a long time. With it, the GPU only just beats the CPU.

This is the results with for loop.
External Media

This is the results without the for loop.
External Media

I realise that creating the threads and transferring the data to/from device is costly, but I would have thought with that by increasing the thread/function order, the GPU would do a lot better than it does against the CPU. The results shown are from using my laptop, but i get the same rsults relative to each other from my tower with NVIDA gtx 260 card, just shorter times. I realise this code is probably really inefficient, using one thread per block etc, but that’s how far I am at at the moment.

Am I doing something that gives the CPU an unfair advantage? I.e. am I comparing apples with oranges?

Best Wishes,
Stu

The most immediate issue is that you are running with one thread per block. CUDA devices are not multicore processors, but are rather “vector machines” wrapped up in an interesting software interface. The hardware is designed to execute groups of 32 threads (called “warps”) at a time. A warp only can contain threads from the same block, so when you run a block with only 1 thread, the block has 1 warp with 31 empty slots. The warp takes pretty much the same time to execute no matter what, so this code has handicapped the GPU by a factor of 32.

Ah thank you so much! I take it that it was able to go slightly faster because it can run more than 1 warp at a time? That answers my question perfectly. I will try again.

I’m surprised that the loop ‘for(int i=0; i<65000; i++)’ wasn’t thrown away by the compiler. Do you have optimization turned off?

I havent manually set any optimization stuff…

I changed the code to use multiple threads and now i get results like shown in this graph:

External Media

(x axis: number of times * factor of adding two vectors of 65536 ints together, y axis, time taken in ms to do so)

That looks sort of correct. Unfortunately when i moved the exe from my laptop to my desktop tower computer with dedicated gfx card (32bit to 64bit computer). i keep getting 0 and 15 for the times insted of a range from 200 - 600 on the gfx card, suggesting I think that something has gone wrong.

Surely the exe should run just the same on both machines right?

i mean a 32 bit exe will run fine on 64 bit machine correctly right?

nvm the previous post is pointless. Discovered that im getting error code 11 once and error code 30’s on my cuda memmory allocations. Will have to fix that first.

int a[N], b[N], c[N];

int *dev_a, *dev_b, *dev_c, *dev_n;

DealWithResult( cudaMalloc((void**) &dev_a, N *sizeof(int) ) ); //these are fine

DealWithResult( cudaMalloc((void**) &dev_b, N *sizeof(int) ) );//these are fine

DealWithResult( cudaMalloc((void**) &dev_c, N *sizeof(int) ) );//these are fine

DealWithResult( cudaMalloc((void**) &dev_n, sizeof(int) ) //these are fine

//------------------------Graphics card test------------------

for(n=0; n<=2; n++)

{	

	start = GetTickCount();

	

	for(int s=0; s<100; s++)

	{

		DealWithResult(cudaMemcpy(dev_n, &n, sizeof(int), cudaMemcpyHostToDevice)); //cant do this more than once

		//number of blocks, then threads per block

		add<<<(N+127)/ 128, 128>>>(dev_a, dev_b, dev_c, dev_n);

		//copy the results to the computer

		DealWithResult( cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost )); //cant do this more than once

		DealWithResult( cudaFree( dev_n )); //cant do this more than once

	}

	

	finish = GetTickCount();

	timeTaken = finish - start;

	printf("%i, GPU, %d\n", n, timeTaken);

}

//display the results







//free the memory from the device

cudaFree( dev_a );

cudaFree( dev_b );

cudaFree( dev_c );

DealWithResult( cudaFree( dev_n )); //cant do this more than once

That shouldn’t be in the loop!
You’re freeing the memory that you want to use at the top of the loop.

but im resetting dev_n at the top of the loop:

DealWithResult(cudaMemcpy(dev_n, &n, sizeof(int), cudaMemcpyHostToDevice)); //cant do this more than once

im sorry i dont quite know how it works yet looking into it in my free time.

im adjusting n and i want it to be changed on the gpu as well so im constantly freeing the memory and then memcopying again. This is no doubt wrong for some reason.

I don’t understand what you think your code is doing, but you do have at least one fundamental misunderstanding.

What you have to do is:

Allocate memory on the device with cudaMalloc

Do things to that memory (perhaps in a loop) such as

  • memcopy to it
  • perform mathematical operations on it in a kernel
  • memcopy from it
    etc.

When you are finished, de-allocate that memory i.e. cudaFree it. At that point it is no longer available for use.

cudaMemcpy(dev_n, &n, sizeof(int), cudaMemcpyHostToDevice) does not “re-set” the memory. It copies values from host memory to device memory. That memory has to have been previously allocated.