matrix multiplication benchmark

zenosparadox · May 21, 2010, 4:33pm

Hi everyone.

I’m trying to show my boss how the GPU improves matrix multiplication by a great amount. In the programming guide, I coded in the matrix multiplication without shared memory access for integers, and it worked perfectly. I changed everything to incorporate floats, and now there is a problem.

Depending on what I set BLOCK_SIZE, the results become unpredictable. For example, I’m able to calculate [1024,4096] x [4096,1024] with BLOCK_SIZE = 16, but if I make the BLOCK_SIZE = 32, 64, etc., then I get garbage values. This is the exact same code for the integers except I changed int to float everywhere applicable.

Also, with integers, I got [4096,16384] x [16384, 4096] to work in 0.89 seconds. For floats, it does the exact same speed, but it doesn’t give the expected output. The kernel is launched like this:

[codebox]

#define BLOCK_SIZE 16

…

// for host matrices A, B, C and device matrices d_A, d_B, d_C

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);

kernel<<<dimGrid, dimBlock>>>(d_A,d_B,d_C);

[/codebox]

What kind of things would cause this to be unreliable? I would think thread block sizes of perhaps 64x64 and 128x128 threads are completely reasonable. I’d appreciate any insight. Thanks!

Daniel

tera · May 21, 2010, 4:41pm

This seems a bit fishy as integer multiplication takes 4x as long as float multiplication. Then again, particularly if you don’t use shared memory, the task probably is entirely bandwidth bound, so no speed differences are expected.

A shot into the dark: You might by accident have converted an index variable from int to float?

seibert · May 21, 2010, 4:43pm

And you don’t get any returned errors from your CUDA calls?

tera · May 21, 2010, 4:43pm

Alternatively, you might have missed one variable in the conversion. We really can’t tell you without seeing any code.

zenosparadox · May 21, 2010, 4:46pm

I just checked all my indices, and they are still the same. So, I’m running matrix multiplication of floats on matrices [4096,16384] x [16384,4096], and BLOCK_SIZE = 32, 64, etc. doesn’t work. BLOCK_SIZE = 16 DOES work. That is so strange to me. The values aren’t overflowing because I’ve made the values in the original matrices between 1.0 and 1.001. So, each element in the C matrix is the result of 16,384 += A.elem * B.elem… resulting in approx. 16,000.something per value in C. I get these values when BLOCK_SIZE = 16.

It does seem odd that float and int run at pretty much the same size. Using the Tesla T10 GPU card on Windows Server 2003.

tera · May 21, 2010, 4:46pm

I’ll eat my previous words. We can tell you without seeing any code: A thread block size of 64x64 is not reasonable - check the programming guide for max block dimensions.

To poke back a little: I’m quite sure this does not work in integer either… ;)

zenosparadox · May 21, 2010, 4:53pm

EDIT: Silly me. I see what you mean! But why did it work for integers? External Media haha

zenosparadox · May 21, 2010, 5:25pm

I hadn’t checked those yet (this was supposed to be a quick-and-easy program to just showcase the power of the GPU). I was running the CPU calculation simultaneously to the GPU, and I thought their output was the same. Must have overlooked something. I wonder why the runtime didn’t tell me I was doing this illegal call? How does the GPU handle getting a request too large? Does it do the whole “unspecified kernel failure” error? Hmmm.

seibert · May 21, 2010, 6:00pm

The runtime tells you that you are making an illegal call by returning an error code. :)

To be fair, it is less obvious how this works with kernel launches in the CUDA runtime API, since a kernel launch with the <<< >>> syntax is not a normal function call with a return code. In that case, the error code will be returned by the next CUDA call, which is why checking return codes for every call is important. An immediate launch problem can be caught by calling cudaGetLastError() after the kernel launch, which would spot the problem you had. If the kernel fails after running for a little while (due to a memory access violation), you’ll probably catch the error when you cudaMemcpy() the result back to the host later.

You can more easily separate the kernel error from a potential memcpy error by calling cudaThreadSynchronize(), which will wait until your kernel is done before returning the current error code. That’s not so good for concurrent CPU/GPU computing, but it can be handy for debugging.

Topic		Replies	Views
matrix multiplication CUDA Programming and Performance	10	3780	March 7, 2010
matrix multiplication with large dimensions CUDA Programming and Performance	7	1567	April 9, 2011
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13269	January 24, 2009
Large matrix multiplication - output is odd. CUDA Programming and Performance	0	1462	May 30, 2012
matrix multiplication--wrong answer CUDA Programming and Performance	6	3788	August 20, 2009
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13328	February 17, 2010
CUDA Matrix Multiplication Issues threads and blocks problem CUDA Programming and Performance	2	3630	March 1, 2009
Example of matrix multiplication (max. block_size) CUDA Programming and Performance	2	11578	January 28, 2010
Subtle problem with matrix-matrix multiplication CUDA Programming and Performance	10	1131	December 12, 2020
Matrix multiplication result Valildation Newbee question CUDA Programming and Performance	2	1486	January 22, 2009

matrix multiplication benchmark

Related topics