Why is this slow

This is my first CUDA program. I used VS2010 to create it. I have NVIDIA GeForce GT 430 (96 cores, 1GB ram). When I run this program, everything executes and I get back expected results, but it takes a long time – about 5 seconds. I would expect 500000 additions to take no time at all.

Obviously, I am doing something wrong – why is it running slow?

Thank you for your help.

PS. The multi-second delay is on the line: cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);


#define N 500000

#include <stdlib.h>
#include <cuda_runtime.h>

global void add(int *a, int *b, int *c) {
int tid = blockIdx.x;
while(tid < N) {
c[tid] = a[tid] + b[tid];
tid += blockDim.x;
}
}

int main(void) {
int *a, *b, *c;
int *dev_a, *dev_b, *dev_c;

a = (int*)malloc(N * sizeof(int));
b = (int*)malloc(N * sizeof(int));
c = (int*)malloc(N * sizeof(int));

for(int i =0; i < N; i++){
	a[i] = i+1;
	b[i] = (i+1) * 2;
}

cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_c, N * sizeof(int));

cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

add<<<1024, 1>>>(dev_a, dev_b, dev_c);

cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);

free(a);
free(b);
free(c);

}

You are using blocks with a single thread. Almost all your CUDA cores are idles.

Thanks for your reply.

I thought <<<1024, 1>>> asks CUDA to invoke my device function on at most 1024 concurrent blocks. So, this is where I expected the parallelism would come from. Did I understand you correctly that the degree of parallelism is solely controlled by the second parameter (1 in this case)?

The other thing that puzzles me is that I only try to execute 500000 additions, but yet it takes seconds to execute. Even on a single core this should complete in a few milliseconds, but yet I have to wait for cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost) for about 5 seconds. Also, during that 5 seconds, the CPU utilization is high, which makes no sense to me since I would expect the CPU to be idle if all processing is on a GPU. Could it be that I am executing on a simulator instead of the device? How do I check that?

Thank you.

You are also doing close to 500.000.000 additions, not 500.000, as the loop increment is wrong so that each block is doing (almost) the whole work again.

tid += blockDim.x;

should be

tid += gridDim.x;

No, it is controlled by both parameters (there is parallelism on the thread and on the block level). Check chapter 2 of the Programming Guide.

The CPU is busy-waiting for the results of the kernel call before the cudaMemcpy() can be performed. You can change that behavior with [font=“Courier New”]cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)[/font].

Thanks for helping me. I made changes to my program, but still my serial .NET code outperforms the GPU code. Below is the GPU code that executes in about 150ms in IDE and about 30ms outside of IDE. Also, included a sample .NET program that applies similar logic (copies two arrays, runs the loop, copies an array) which executes in about 33ms in IDE and 12ms outside of IDE.

Why is .NET serial code so much faster? Thanks for your help!

GPU code (150ms in IDE and about 30ms outside of IDE):

#define N 500000

#include <stdlib.h>

#include <cuda_runtime.h>

#include <iostream>

#include <ctime>

__global__ void add(int *a, int *b, int *c) {

	int tid = threadIdx.x + blockIdx.x * blockDim.x;

	while(tid < N) {

		c[tid] = a[tid] + b[tid];

		tid += blockDim.x * gridDim.x;

	}

}

#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

static void HandleError( cudaError_t err,

                         const char *file,

                         int line ) {

    if (err != cudaSuccess) {

		std::cout << "Error: " << cudaGetErrorString( err ) << "\nFile: " << file << "\nLine: " + line;

		std::cin.ignore(); 

        exit( EXIT_FAILURE );

    }

}

int main(void) {

	int *a, *b, *c;

	int *dev_a, *dev_b, *dev_c;

	a = (int*)malloc(N * sizeof(int));

	b = (int*)malloc(N * sizeof(int));

	c = (int*)malloc(N * sizeof(int));

	for(int i =0; i < N; i++){

		a[i] = i+1;

		b[i] = (i+1) * 2;

	} 

	unsigned int start = clock();

	HANDLE_ERROR(cudaMalloc((void**)&dev_a, N * sizeof(int)));

	HANDLE_ERROR(cudaMalloc((void**)&dev_b, N * sizeof(int)));

	HANDLE_ERROR(cudaMalloc((void**)&dev_c, N * sizeof(int)));

	HANDLE_ERROR(cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice));

	HANDLE_ERROR(cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice));

	add<<<128, 128>>>(dev_a, dev_b, dev_c);

	HANDLE_ERROR(cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost));

	std::cout << "Time taken in millisecs: " << clock() - start;

	HANDLE_ERROR(cudaFree(dev_a));

	HANDLE_ERROR(cudaFree(dev_b));

	HANDLE_ERROR(cudaFree(dev_c));

	free(a);

	free(b);

	free(c);

	std::cin.ignore(); 

}

.NET Serial code (33ms in IDE and 12ms outside of IDE)

static void Main(string[] args)

{

    int N = 500000;

double[] a = new double[N];

    double[] b = new double[N];

for (int i = 0; i < N; i++)

    {

        a[i] = i + 1;

        b[i] = (i + 1) * 2;

    }

Stopwatch w = new Stopwatch();

    w.Start();

var a1 = a.ToArray();

    var b1 = b.ToArray();

    var c1 = new double[N];

for (int i = 0; i < N; i++)

    {

        c1[i] = a1[i] + b1[i];

    }

double[] c = c1.ToArray();

w.Stop();

    Console.WriteLine("Finished in " + w.ElapsedMilliseconds + "ms.");

    Console.Read();

}

.NET doesn’t need to shovel all data through PCIe. You need a task that performs more work on the GPU per data transferred, so that the PCIe penalty can be compensated.

Hi,
It is not a fair comparison by timing the cudaMalloc. Exclude the cudaMalloc from the timing. Another thing to keep in mind is that the first call to cudaMalloc will setup the cuda context.