Hi,
I’m kinda new at CUDA and was looking at simple ways to time the entire CUDA kernel execution. Here’s what I’m doing :
[codebox]clock_t start_d=clock();
vecAdd<<<1,25>>>(a_d,b_d,c_d);
cudaThreadSynchronize();
clock_t end_d = clock();
double time_d = (double)(end_d-start_d)/CLOCKS_PER_SEC;[/codebox]
where a_d,b_d,c_d are device resident arrays with 25 elements. and time_d represents the total time to compute the code.
The kernel only multiplies each element of a_d and b_d and stores it in c_d.
[codebox]global void vecAdd(int *A,int *B,int *C)
{
int i = threadIdx.x;
if(i<25)
C[i] = A[i] * B[i];
}[/codebox]
Now, I’m trying to compare this with a cpu version of the same code :
[codebox]clock_t start_h=clock();
vecAdd_h(a,b,c);
clock_t end_h = clock();[/codebox]
and
[codebox]void vecAdd_h(int *A1,int *B1, int *C1)
{
for(int i=0;i<25;i++)
C1[i] = A1[i] * B1[i];
}[/codebox]
where a,b,c are host arrays. But the results I’m getting don’t look promising:
[codebox]Time on device: 0.000093
Time on CPU : 0.000001[/codebox]
Am I doing something wrong here?
For clarity, here’s the machine specs:
Mac Book Pro - 8600M GT
Platform : Mac OSX 10.5.6
CUDA Version : 2.0
- Sahil