timing problem timing prob in matrix multiplication

hi,

i am new to cuda and have been trying out matrix multiplication. But i am having a problem when i time the gpu runs. i am posting the code below.

[codebox]cudaEvent_t start, stop;

float gpu_time;

cudaEventCreate(&start);

cudaEventCreate(&stop);

cudaEventRecord(start, 0 );

// Launch code on device

mat_mul_on_device<<<dimGrid, dimBlock>>>(a_d, b_d, p_d, WIDTH);

cudaEventRecord(stop, 0 );

cudaEventSynchronize(stop );

cudaEventElapsedTime(&gpu_time, start, stop);

cudaEventDestroy(start);

cudaEventDestroy(stop);[/codebox]

After this I print the time. I get these times for various widths of (square) matrices being multiplied:

width cpu time (s) gpu time (ms)

50 | 0.0000000e+00 | 6.0927998e-02

100 | 0.0000000e+00 | 4.7520000e-02

150 | 0.0000000e+00 | 6.4640000e-02

200 | 0.0000000e+00 | 6.5664001e-02

250 | 0.0000000e+00 | 6.5087996e-02

300 | 0.0000000e+00 | 6.7359999e-02

350 | 0.0000000e+00 | 6.6271998e-02

400 | 0.0000000e+00 | 6.6720001e-02

450 | 0.0000000e+00 | 6.7103997e-02

500 | 1.0000000e+00 | 6.6656001e-02

550 | 1.0000000e+00 | 6.8223998e-02

600 | 2.0000000e+00 | 7.0496000e-02

650 | 3.0000000e+00 | 7.5680003e-02

700 | 3.0000000e+00 | 7.6831996e-02

750 | 4.0000000e+00 | 7.8720003e-02

800 | 5.0000000e+00 | 8.0480002e-02

850 | 6.0000000e+00 | 8.0544002e-02

900 | 8.0000000e+00 | 8.1248000e-02

950 | 9.0000000e+00 | 8.2528003e-02

1000 | 1.1000000e+01 | 8.1184000e-02

2000 | 1.1900000e+02 | 7.9839997e-02

My problem is that these times seem to be going up and down with increase in width which i dont understand. Is there a header file to be included while using “cudaEvent…” commands?

Please help, Thanks!

Those timing calls are correct. I would guess that your kernel isn’t actually running at all - a 2000x2000 matrix multiply in 7.98 ms is over 200 000 GFLOPs, which is clearly nonsense. Add some error checking after the launch and check what is happening.

thanks, the kernel wasnt launching as you said! For matrix width > 22, I’m getting all zero’s in the product matrix when i copy it back to the host (i initialise it to all zeros when copying to the device). my kernel follows below, i’m guessing its a memory allocation issue. but am not able to say what exactly the problem is.

[codebox]

global void mat_mul_on_device(float *a_d, float *b_d, float *p_d, int w){

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int k;

// p_ij stores ij th element of p after the computation of a thread

float p_ij=0.0;



// Compute p_ij

for(k = 0; k < w; ++k){

	p_ij += a_d[i*w + k] * b_d[k*w + j];

}

__syncthreads();

// Store result in global memory for this thread computation

p_d[i*w + j] = p_ij;

// p_ij stores ij th element of p after the computation of a thread

float p_ij=0.0;



// Compute p_ij

for(k = 0; k < w; ++k){

	p_ij += a_d[i*w + k] * b_d[k*w + j];

}

__syncthreads();

// Store result in global memory for this thread computation

p_d[i*w + j] = p_ij;

}

[/codebox]

Hi,

I figured the problem was that i was using just one grid and for matrix width = 23 the number of threads > 512…

Thanks for the help!