Timing CUDA Code To find the best way to time CUDA code

sahil · January 6, 2009, 1:30pm

Hi,

I’m kinda new at CUDA and was looking at simple ways to time the entire CUDA kernel execution. Here’s what I’m doing :

[codebox]clock_t start_d=clock();

    vecAdd<<<1,25>>>(a_d,b_d,c_d);

    cudaThreadSynchronize();

clock_t end_d = clock();

double time_d = (double)(end_d-start_d)/CLOCKS_PER_SEC;[/codebox]

where a_d,b_d,c_d are device resident arrays with 25 elements. and time_d represents the total time to compute the code.

The kernel only multiplies each element of a_d and b_d and stores it in c_d.

[codebox]global void vecAdd(int *A,int *B,int *C)

{

    int i = threadIdx.x;

    if(i<25)

            C[i] = A[i] * B[i];

}[/codebox]

Now, I’m trying to compare this with a cpu version of the same code :

[codebox]clock_t start_h=clock();

    vecAdd_h(a,b,c);

clock_t end_h = clock();[/codebox]

and

[codebox]void vecAdd_h(int *A1,int *B1, int *C1)

{

    for(int i=0;i<25;i++)

            C1[i] = A1[i] * B1[i];

}[/codebox]

where a,b,c are host arrays. But the results I’m getting don’t look promising:

[codebox]Time on device: 0.000093

Time on CPU : 0.000001[/codebox]

Am I doing something wrong here?

For clarity, here’s the machine specs:

Mac Book Pro - 8600M GT

Platform : Mac OSX 10.5.6

CUDA Version : 2.0

Sahil

MisterAnderson42 · January 6, 2009, 1:39pm

Using the event API for timing will give you much more accurate results of the time spent in a single extremely short kernel launch.

The only other thing you are doing “wrong” is not looking at a wide enough parameter space. Try benchmarking your code for, 10,000 or 1,000,000 elements instead of 25 and you should see a speedup. There are lots and lots of threads on the forums discussing this, but here is the short version:

Each kernel call has an overhead of ~10-20 us, which can be even longer than the running time of an extremely small kernel launch
The GPU is built to efficiently process 10’s of thousands of threads concurrently. Unless you run at least this many threads, you aren’t even stretching the GPUs legs, much less making it run at full speed.

sahil · January 6, 2009, 3:03pm

Thanks MrAnderson for the insightful points. I did notice that it’s a rather short number of elements. But I tried modifying the number, and something strange happened, I started getting wrong results for the same multiplied element:

[codebox]

C[63] = 65 65

C[64] = 288 252

C[65] = 120 180

C[66] = 44 34

C[67] = 63 462

C[68] = 108 180

C[69] = 15 6

GPU Time : 0.000122

CPU Time : 0.000002

[/codebox]

If you

And it gets worse after 100 elements.

Ocire · January 6, 2009, 4:17pm

remember that only 512 threads per block are allowed.
so something like <<<1,25000>>> will not work.
increase the number of blocks, as this will also give you a speedup.
make sure, your allocs and memcpys work, check everything for errors.

sahil · January 6, 2009, 4:19pm

Thanks Ocire. I was aware that the limitation of threads/block might cause an issue, but I’m not able to understand why I’m getting wrong results beyond just 64 threads.

MisterAnderson42 · January 6, 2009, 4:43pm

Did you change your kernel to handle larger arrays?

You probably want to change:

int i = threadIdx.x;

into this:

int i = blockIdx.x * blockDim.x + threadIdx.x;

And then launch an appropriate number of blocks to get the desired total number of threads.

Topic		Replies	Views
CUDA trouble CUDA Programming and Performance	3	977	March 19, 2013
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	190	June 11, 2024
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7006	January 30, 2008
A simple problem CUDA Programming and Performance	10	5205	October 11, 2007
Run a million threads or blocks on a single kernel function, and still works. It supposed to be 512 at maximum, isn't it? CUDA Programming and Performance	4	1293	January 6, 2017
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9346	January 7, 2008
Number of GPU clock cycles CUDA Programming and Performance	15	10105	June 16, 2017
Really simple while loop issues CUDA Programming and Performance	4	3092	October 27, 2014
Newbie Question: Threads What's going on here? CUDA Programming and Performance	5	2226	July 18, 2008
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4323	March 7, 2011

Timing CUDA Code To find the best way to time CUDA code

Related topics