Hey,
I’m new at CUDA and have been really excited about it, so I bought the book ‘Cuda by example’ and got all set up.
I got to about the end of chapter 3, when I decided I wanted to perform basic test of CPU against GPU with following code:
#include
#include <time.h>
#include <windows.h>
#define N 65530
global void add(int* a, int* b, int *c)
{
int tid = blockIdx.x;
if(tid < N)
{
for(int i=0; i<65000; i++)
{
c[tid] = a[tid] + b[tid];
}
}
}
void cpuAdd(int* a, int* b, int *c)
{
for(int s=0; s<N; s++)
{
for(int i=0; i<65000; i++)
{
c[s] = a[s] + b[s];
}
}
}
int main(void)
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void**) &dev_a, N *sizeof(int) );
cudaMalloc((void**) &dev_b, N *sizeof(int) );
cudaMalloc((void**) &dev_c , N *sizeof(int) );
//fill the arrays a and b on cpu
for(int i=0; i<N; i++)
{
a[i] = -i;
b[i] = i * i;
}
//copy the arrays a and b to the device
DWORD start = GetTickCount();
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
add<<<N, 1>>>(dev_a, dev_b, dev_c);
//copy the results to the computer
cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost );
DWORD finish = GetTickCount();
DWORD timeTaken = finish - start;
//display the results
printf("it took %ld time to get the results \n", timeTaken);
//now test if was cpu
start = GetTickCount();
cpuAdd(a, b, c);
finish = GetTickCount();
timeTaken = finish - start;
//display the results
printf("it took %ld time to get the results from CPU \n", timeTaken);
//free the memory from the device
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
getchar();
return 0;
}
Unfortunately the results are not as good as I had hoped. Without the ‘for(int i=0; i<65000; i++)’ loop in the add functions, the CPU beats the GPU by a long time. With it, the GPU only just beats the CPU.
This is the results with for loop.
External Media
This is the results without the for loop.
External Media
I realise that creating the threads and transferring the data to/from device is costly, but I would have thought with that by increasing the thread/function order, the GPU would do a lot better than it does against the CPU. The results shown are from using my laptop, but i get the same rsults relative to each other from my tower with NVIDA gtx 260 card, just shorter times. I realise this code is probably really inefficient, using one thread per block etc, but that’s how far I am at at the moment.
Am I doing something that gives the CPU an unfair advantage? I.e. am I comparing apples with oranges?
Best Wishes,
Stu