Hi everyone.

I’m trying to show my boss how the GPU improves matrix multiplication by a great amount. In the programming guide, I coded in the matrix multiplication without shared memory access for integers, and it worked perfectly. I changed everything to incorporate floats, and now there is a problem.

Depending on what I set BLOCK_SIZE, the results become unpredictable. For example, I’m able to calculate [1024,4096] x [4096,1024] with BLOCK_SIZE = 16, but if I make the BLOCK_SIZE = 32, 64, etc., then I get garbage values. This is the exact same code for the integers except I changed int to float everywhere applicable.

Also, with integers, I got [4096,16384] x [16384, 4096] to work in 0.89 seconds. For floats, it does the exact same speed, but it doesn’t give the expected output. The kernel is launched like this:

[codebox]

#define BLOCK_SIZE 16

…

// for host matrices A, B, C and device matrices d_A, d_B, d_C

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);

kernel<<<dimGrid, dimBlock>>>(d_A,d_B,d_C);

[/codebox]

What kind of things would cause this to be unreliable? I would think thread block sizes of perhaps 64x64 and 128x128 threads are completely reasonable. I’d appreciate any insight. Thanks!

Daniel