matrix multiplication benchmark

Hi everyone.

I’m trying to show my boss how the GPU improves matrix multiplication by a great amount. In the programming guide, I coded in the matrix multiplication without shared memory access for integers, and it worked perfectly. I changed everything to incorporate floats, and now there is a problem.

Depending on what I set BLOCK_SIZE, the results become unpredictable. For example, I’m able to calculate [1024,4096] x [4096,1024] with BLOCK_SIZE = 16, but if I make the BLOCK_SIZE = 32, 64, etc., then I get garbage values. This is the exact same code for the integers except I changed int to float everywhere applicable.

Also, with integers, I got [4096,16384] x [16384, 4096] to work in 0.89 seconds. For floats, it does the exact same speed, but it doesn’t give the expected output. The kernel is launched like this:


#define BLOCK_SIZE 16

// for host matrices A, B, C and device matrices d_A, d_B, d_C

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);

kernel<<<dimGrid, dimBlock>>>(d_A,d_B,d_C);


What kind of things would cause this to be unreliable? I would think thread block sizes of perhaps 64x64 and 128x128 threads are completely reasonable. I’d appreciate any insight. Thanks!


This seems a bit fishy as integer multiplication takes 4x as long as float multiplication. Then again, particularly if you don’t use shared memory, the task probably is entirely bandwidth bound, so no speed differences are expected.

A shot into the dark: You might by accident have converted an index variable from int to float?

And you don’t get any returned errors from your CUDA calls?

Alternatively, you might have missed one variable in the conversion. We really can’t tell you without seeing any code.

I just checked all my indices, and they are still the same. So, I’m running matrix multiplication of floats on matrices [4096,16384] x [16384,4096], and BLOCK_SIZE = 32, 64, etc. doesn’t work. BLOCK_SIZE = 16 DOES work. That is so strange to me. The values aren’t overflowing because I’ve made the values in the original matrices between 1.0 and 1.001. So, each element in the C matrix is the result of 16,384 += A.elem * B.elem… resulting in approx. 16,000.something per value in C. I get these values when BLOCK_SIZE = 16.

It does seem odd that float and int run at pretty much the same size. Using the Tesla T10 GPU card on Windows Server 2003.

I’ll eat my previous words. We can tell you without seeing any code: A thread block size of 64x64 is not reasonable - check the programming guide for max block dimensions.

To poke back a little: I’m quite sure this does not work in integer either… ;)

EDIT: Silly me. I see what you mean! But why did it work for integers? :/ haha

I hadn’t checked those yet (this was supposed to be a quick-and-easy program to just showcase the power of the GPU). I was running the CPU calculation simultaneously to the GPU, and I thought their output was the same. Must have overlooked something. I wonder why the runtime didn’t tell me I was doing this illegal call? How does the GPU handle getting a request too large? Does it do the whole “unspecified kernel failure” error? Hmmm.

The runtime tells you that you are making an illegal call by returning an error code. :)

To be fair, it is less obvious how this works with kernel launches in the CUDA runtime API, since a kernel launch with the <<< >>> syntax is not a normal function call with a return code. In that case, the error code will be returned by the next CUDA call, which is why checking return codes for every call is important. An immediate launch problem can be caught by calling cudaGetLastError() after the kernel launch, which would spot the problem you had. If the kernel fails after running for a little while (due to a memory access violation), you’ll probably catch the error when you cudaMemcpy() the result back to the host later.

You can more easily separate the kernel error from a potential memcpy error by calling cudaThreadSynchronize(), which will wait until your kernel is done before returning the current error code. That’s not so good for concurrent CPU/GPU computing, but it can be handy for debugging.