Why is this slow

popsovy · February 3, 2012, 10:51pm

This is my first CUDA program. I used VS2010 to create it. I have NVIDIA GeForce GT 430 (96 cores, 1GB ram). When I run this program, everything executes and I get back expected results, but it takes a long time – about 5 seconds. I would expect 500000 additions to take no time at all.

Obviously, I am doing something wrong – why is it running slow?

Thank you for your help.

PS. The multi-second delay is on the line: cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);

#define N 500000

#include <stdlib.h>
#include <cuda_runtime.h>

global void add(int *a, int *b, int *c) {
int tid = blockIdx.x;
while(tid < N) {
c[tid] = a[tid] + b[tid];
tid += blockDim.x;
}
}

int main(void) {
int *a, *b, *c;
int *dev_a, *dev_b, *dev_c;

a = (int*)malloc(N * sizeof(int));
b = (int*)malloc(N * sizeof(int));
c = (int*)malloc(N * sizeof(int));

for(int i =0; i < N; i++){
	a[i] = i+1;
	b[i] = (i+1) * 2;
}

cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_c, N * sizeof(int));

cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

add<<<1024, 1>>>(dev_a, dev_b, dev_c);

cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);

free(a);
free(b);
free(c);

}

mfatica · February 3, 2012, 11:04pm

You are using blocks with a single thread. Almost all your CUDA cores are idles.

popsovy · February 4, 2012, 3:24am

Thanks for your reply.

I thought <<<1024, 1>>> asks CUDA to invoke my device function on at most 1024 concurrent blocks. So, this is where I expected the parallelism would come from. Did I understand you correctly that the degree of parallelism is solely controlled by the second parameter (1 in this case)?

The other thing that puzzles me is that I only try to execute 500000 additions, but yet it takes seconds to execute. Even on a single core this should complete in a few milliseconds, but yet I have to wait for cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost) for about 5 seconds. Also, during that 5 seconds, the CPU utilization is high, which makes no sense to me since I would expect the CPU to be idle if all processing is on a GPU. Could it be that I am executing on a simulator instead of the device? How do I check that?

Thank you.

tera · February 4, 2012, 11:10am

You are also doing close to 500.000.000 additions, not 500.000, as the loop increment is wrong so that each block is doing (almost) the whole work again.

tid += blockDim.x;

should be

tid += gridDim.x;

tera · February 4, 2012, 11:21am

No, it is controlled by both parameters (there is parallelism on the thread and on the block level). Check chapter 2 of the Programming Guide.

The CPU is busy-waiting for the results of the kernel call before the cudaMemcpy() can be performed. You can change that behavior with [font=“Courier New”]cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)[/font].

popsovy · February 6, 2012, 8:19pm

Thanks for helping me. I made changes to my program, but still my serial .NET code outperforms the GPU code. Below is the GPU code that executes in about 150ms in IDE and about 30ms outside of IDE. Also, included a sample .NET program that applies similar logic (copies two arrays, runs the loop, copies an array) which executes in about 33ms in IDE and 12ms outside of IDE.

Why is .NET serial code so much faster? Thanks for your help!

GPU code (150ms in IDE and about 30ms outside of IDE):

#define N 500000

#include <stdlib.h>

#include <cuda_runtime.h>

#include <iostream>

#include <ctime>

__global__ void add(int *a, int *b, int *c) {

	int tid = threadIdx.x + blockIdx.x * blockDim.x;

	while(tid < N) {

		c[tid] = a[tid] + b[tid];

		tid += blockDim.x * gridDim.x;

	}

}

#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

static void HandleError( cudaError_t err,

                         const char *file,

                         int line ) {

    if (err != cudaSuccess) {

		std::cout << "Error: " << cudaGetErrorString( err ) << "\nFile: " << file << "\nLine: " + line;

		std::cin.ignore(); 

        exit( EXIT_FAILURE );

    }

}

int main(void) {

	int *a, *b, *c;

	int *dev_a, *dev_b, *dev_c;

	a = (int*)malloc(N * sizeof(int));

	b = (int*)malloc(N * sizeof(int));

	c = (int*)malloc(N * sizeof(int));

	for(int i =0; i < N; i++){

		a[i] = i+1;

		b[i] = (i+1) * 2;

	} 

	unsigned int start = clock();

	HANDLE_ERROR(cudaMalloc((void**)&dev_a, N * sizeof(int)));

	HANDLE_ERROR(cudaMalloc((void**)&dev_b, N * sizeof(int)));

	HANDLE_ERROR(cudaMalloc((void**)&dev_c, N * sizeof(int)));

	HANDLE_ERROR(cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice));

	HANDLE_ERROR(cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice));

	add<<<128, 128>>>(dev_a, dev_b, dev_c);

	HANDLE_ERROR(cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost));

	std::cout << "Time taken in millisecs: " << clock() - start;

	HANDLE_ERROR(cudaFree(dev_a));

	HANDLE_ERROR(cudaFree(dev_b));

	HANDLE_ERROR(cudaFree(dev_c));

	free(a);

	free(b);

	free(c);

	std::cin.ignore(); 

}

.NET Serial code (33ms in IDE and 12ms outside of IDE)

static void Main(string[] args)

{

    int N = 500000;

double[] a = new double[N];

    double[] b = new double[N];

for (int i = 0; i < N; i++)

    {

        a[i] = i + 1;

        b[i] = (i + 1) * 2;

    }

Stopwatch w = new Stopwatch();

    w.Start();

var a1 = a.ToArray();

    var b1 = b.ToArray();

    var c1 = new double[N];

for (int i = 0; i < N; i++)

    {

        c1[i] = a1[i] + b1[i];

    }

double[] c = c1.ToArray();

w.Stop();

    Console.WriteLine("Finished in " + w.ElapsedMilliseconds + "ms.");

    Console.Read();

}

tera · February 6, 2012, 8:26pm

.NET doesn’t need to shovel all data through PCIe. You need a task that performs more work on the GPU per data transferred, so that the PCIe penalty can be compensated.

brano · February 7, 2012, 9:53am

Hi,
It is not a fair comparison by timing the cudaMalloc. Exclude the cudaMalloc from the timing. Another thing to keep in mind is that the first call to cudaMalloc will setup the cuda context.

Topic		Replies	Views
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	245	July 7, 2024
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2364	September 3, 2010
CUDA slower than CPU? CUDA Programming and Performance	7	839	August 18, 2023
Cuda program taking more time. CUDA Programming and Performance	15	7059	November 21, 2010
Cuda code performance CUDA Programming and Performance	14	3149	December 16, 2014
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13438	July 9, 2008
Time consuming comparison between 820M GPU combination code and pure C++ on CPU CUDA Programming and Performance	2	488	May 22, 2019
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
Cannot find a reason why CPU process much faster than GPU process in simple code CUDA Programming and Performance	3	499	November 19, 2018
Latency when running a cuda code CUDA Programming and Performance	10	3423	December 30, 2020

Why is this slow

Related topics