Timing CUDA Code To find the best way to time CUDA code

Hi,

I’m kinda new at CUDA and was looking at simple ways to time the entire CUDA kernel execution. Here’s what I’m doing :

[codebox]clock_t start_d=clock();

    vecAdd<<<1,25>>>(a_d,b_d,c_d);

    cudaThreadSynchronize();

clock_t end_d = clock();

double time_d = (double)(end_d-start_d)/CLOCKS_PER_SEC;[/codebox]

where a_d,b_d,c_d are device resident arrays with 25 elements. and time_d represents the total time to compute the code.

The kernel only multiplies each element of a_d and b_d and stores it in c_d.

[codebox]global void vecAdd(int *A,int *B,int *C)

{

    int i = threadIdx.x;

    if(i<25)

            C[i] = A[i] * B[i];

}[/codebox]

Now, I’m trying to compare this with a cpu version of the same code :

[codebox]clock_t start_h=clock();

    vecAdd_h(a,b,c);

clock_t end_h = clock();[/codebox]

and

[codebox]void vecAdd_h(int *A1,int *B1, int *C1)

{

    for(int i=0;i<25;i++)

            C1[i] = A1[i] * B1[i];

}[/codebox]

where a,b,c are host arrays. But the results I’m getting don’t look promising:

[codebox]Time on device: 0.000093

Time on CPU : 0.000001[/codebox]

Am I doing something wrong here?

For clarity, here’s the machine specs:

Mac Book Pro - 8600M GT

Platform : Mac OSX 10.5.6

CUDA Version : 2.0

  • Sahil

Using the event API for timing will give you much more accurate results of the time spent in a single extremely short kernel launch.

The only other thing you are doing “wrong” is not looking at a wide enough parameter space. Try benchmarking your code for, 10,000 or 1,000,000 elements instead of 25 and you should see a speedup. There are lots and lots of threads on the forums discussing this, but here is the short version:

  1. Each kernel call has an overhead of ~10-20 us, which can be even longer than the running time of an extremely small kernel launch
  2. The GPU is built to efficiently process 10’s of thousands of threads concurrently. Unless you run at least this many threads, you aren’t even stretching the GPUs legs, much less making it run at full speed.

Thanks MrAnderson for the insightful points. I did notice that it’s a rather short number of elements. But I tried modifying the number, and something strange happened, I started getting wrong results for the same multiplied element:

[codebox]

C[63] = 65 65

C[64] = 288 252

C[65] = 120 180

C[66] = 44 34

C[67] = 63 462

C[68] = 108 180

C[69] = 15 6

GPU Time : 0.000122

CPU Time : 0.000002

[/codebox]

If you

And it gets worse after 100 elements.

remember that only 512 threads per block are allowed.
so something like <<<1,25000>>> will not work.
increase the number of blocks, as this will also give you a speedup.
make sure, your allocs and memcpys work, check everything for errors.

Thanks Ocire. I was aware that the limitation of threads/block might cause an issue, but I’m not able to understand why I’m getting wrong results beyond just 64 threads.

Did you change your kernel to handle larger arrays?

You probably want to change:

int i = threadIdx.x;

into this:

int i = blockIdx.x * blockDim.x + threadIdx.x;

And then launch an appropriate number of blocks to get the desired total number of threads.