timing problem timing prob in matrix multiplication

phiz · February 2, 2010, 11:45am

hi,

i am new to cuda and have been trying out matrix multiplication. But i am having a problem when i time the gpu runs. i am posting the code below.

[codebox]cudaEvent_t start, stop;

float gpu_time;

cudaEventCreate(&start);

cudaEventCreate(&stop);

cudaEventRecord(start, 0 );

// Launch code on device

mat_mul_on_device<<<dimGrid, dimBlock>>>(a_d, b_d, p_d, WIDTH);

cudaEventRecord(stop, 0 );

cudaEventSynchronize(stop );

cudaEventElapsedTime(&gpu_time, start, stop);

cudaEventDestroy(start);

cudaEventDestroy(stop);[/codebox]

After this I print the time. I get these times for various widths of (square) matrices being multiplied:

width cpu time (s) gpu time (ms)

50 | 0.0000000e+00 | 6.0927998e-02

100 | 0.0000000e+00 | 4.7520000e-02

150 | 0.0000000e+00 | 6.4640000e-02

200 | 0.0000000e+00 | 6.5664001e-02

250 | 0.0000000e+00 | 6.5087996e-02

300 | 0.0000000e+00 | 6.7359999e-02

350 | 0.0000000e+00 | 6.6271998e-02

400 | 0.0000000e+00 | 6.6720001e-02

450 | 0.0000000e+00 | 6.7103997e-02

500 | 1.0000000e+00 | 6.6656001e-02

550 | 1.0000000e+00 | 6.8223998e-02

600 | 2.0000000e+00 | 7.0496000e-02

650 | 3.0000000e+00 | 7.5680003e-02

700 | 3.0000000e+00 | 7.6831996e-02

750 | 4.0000000e+00 | 7.8720003e-02

800 | 5.0000000e+00 | 8.0480002e-02

850 | 6.0000000e+00 | 8.0544002e-02

900 | 8.0000000e+00 | 8.1248000e-02

950 | 9.0000000e+00 | 8.2528003e-02

1000 | 1.1000000e+01 | 8.1184000e-02

2000 | 1.1900000e+02 | 7.9839997e-02

My problem is that these times seem to be going up and down with increase in width which i dont understand. Is there a header file to be included while using “cudaEvent…” commands?

Please help, Thanks!

avidday · February 2, 2010, 11:55am

Those timing calls are correct. I would guess that your kernel isn’t actually running at all - a 2000x2000 matrix multiply in 7.98 ms is over 200 000 GFLOPs, which is clearly nonsense. Add some error checking after the launch and check what is happening.

phiz · February 2, 2010, 12:33pm

thanks, the kernel wasnt launching as you said! For matrix width > 22, I’m getting all zero’s in the product matrix when i copy it back to the host (i initialise it to all zeros when copying to the device). my kernel follows below, i’m guessing its a memory allocation issue. but am not able to say what exactly the problem is.

[codebox]

global void mat_mul_on_device(float *a_d, float *b_d, float *p_d, int w){

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int k;

// p_ij stores ij th element of p after the computation of a thread

float p_ij=0.0;



// Compute p_ij

for(k = 0; k < w; ++k){

	p_ij += a_d[i*w + k] * b_d[k*w + j];

}

__syncthreads();

// Store result in global memory for this thread computation

p_d[i*w + j] = p_ij;

// p_ij stores ij th element of p after the computation of a thread

float p_ij=0.0;



// Compute p_ij

for(k = 0; k < w; ++k){

	p_ij += a_d[i*w + k] * b_d[k*w + j];

}

__syncthreads();

// Store result in global memory for this thread computation

p_d[i*w + j] = p_ij;

}

[/codebox]

phiz · February 4, 2010, 6:19am

Hi,

I figured the problem was that i was using just one grid and for matrix width = 23 the number of threads > 512…

Thanks for the help!