I’m trying to show my boss how the GPU improves matrix multiplication by a great amount. In the programming guide, I coded in the matrix multiplication without shared memory access for integers, and it worked perfectly. I changed everything to incorporate floats, and now there is a problem.
Depending on what I set BLOCK_SIZE, the results become unpredictable. For example, I’m able to calculate [1024,4096] x [4096,1024] with BLOCK_SIZE = 16, but if I make the BLOCK_SIZE = 32, 64, etc., then I get garbage values. This is the exact same code for the integers except I changed int to float everywhere applicable.
Also, with integers, I got [4096,16384] x [16384, 4096] to work in 0.89 seconds. For floats, it does the exact same speed, but it doesn’t give the expected output. The kernel is launched like this:
#define BLOCK_SIZE 16
// for host matrices A, B, C and device matrices d_A, d_B, d_C
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
What kind of things would cause this to be unreliable? I would think thread block sizes of perhaps 64x64 and 128x128 threads are completely reasonable. I’d appreciate any insight. Thanks!