Non Square Matrix Multiplication on CUDA Matrix Multiplication Help

Hi everyone, its the first time I post here, but im having problems with matrix multiplication on non square matrixes.
In the CUDA examples, if I use the sdk code, its is valid for square matrixes.
For example, using a BLOCK_SIZE of 16, and two matrixes of 3200x3200 elements, the results are correct.

However, whe using the same BLOCK_SIZE and matrixA=3200x1600 and matrixB=1600x3200, I get incorrect results.
Does anyone know why the cuda example doesnt work, and if it were possible, to give an example of a correct matrix multiplication, because I cant get my head around things!
Thanks in advance,
David Lisin

Which SDK example are you referring to? if you are talking about NVIDIA_CUDA_SDK/projects/matrixMul, if you take a look at matrixMul.h, WA HA WB etc… are defined so that they are not square matrices, and when i run the example on my machine, the tests pass.

The test does pass, but doesnt return the correct result. I have prepared a mex file with cuda for Matlab, and it returns a matrix with the results inverted.

the actual problem in the mutliplication is that the kernel is slightly wrong, it should actually be:

// Multiply the two matrices together;

// each thread computes one element

// of the block sub-matrix

for (int k = 0; k < BLOCK_SIZE_VALUE; ++k)

        Csub += As[k][tx] * Bs[ty][k];

The original problem was that the matrixes had to be multiples of 16. I have already got passed that problem, but what i have found now is that the matrix multiplication works with single precision, but unfortunately doesnt (at the moment) with double precision.

Thanks for the response anyway, and hope this helps anyone who is currently developing wiht CUDA.

David Lisin

The CUBLAS library provides double precision matrix arithmetic.

N.

Again, thanks for the reply, but i have the current situation:

I am doing several calculations one after the other in cuda to maximize times. Ill try to set a simple example, just to state the matter:

(MATLAB pseudocode):

for i=1:100

result=A[1:end-i].*B[1:end-i];

result=result*C[1:end-i];

resutl=result*D[i];

result=1./result*E[1:end-1];

end

The thing is, i can copy A,B,C,D,E to the device,only once!!! And do all the calculations in different kernels, but i repeat, copying only once to device, using the result of one kernel as the pointer to the next, speeding things up A LOT. I am using my own kernels for the rest of the functions.

If i use cublas, i need to copy the result of kernel1, back to cpu, reserve memory with cublasmalloc, execute dgemm, copy back, reserver memory of the result for cuda, copy back to device, and then carry on executing. But the data trasfers ruin my speedups…thats why i was wondering if double precision works for anyone on matrix multiplication, since the one offered bu cuda sdk, modified for double precision, and compiled with -arch sm_13 doesnt work.

Thanks again,

David Lisin

No, you don’t.
Cublas can operate on the data resident on the GPU.

Something similar to this ( the code is for single, you can use double ):

! Allocate matrices on GPU
cublasAlloc(m1m1, size_of_real, devPtrA)
cublasAlloc(m1
m1, size_of_real, devPtrB)
cublasAlloc(m1*m1, size_of_real, devPtrC)
!Copy data from CPU to GPU
cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1)
cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1)
cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1)
! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data in GPU memory)
call cublasSGEMM (‘n’,‘n’,m1,m1,m1,alpha,devPtrA,m1,devPtrB,m1,beta,devPtrC,m1)
! Add all the other CUBLAS calls you need.
!Copy data from GPU to CPU
cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1)
! Free memory on device
cublasFree(devPtrA)

You don’t. You can leave the results on the device between CUBLAS calls, you can intermingle your own kernels with CUBLAS functions operating on the same data or intermediate results in device memory, and you can use cudaMalloc(), cudaMemcpy() and cudaFree() to manage memory on the device with CUBLAS. CUBLAS pointers are just regular CUDA pointers and can be used interchangeably (remembering that CUBLAS uses Fortran order storage).

I had tried this a couple of weeks ago, but didnt get correct results (obviously my fault, because I have tried it today, and works perfectly)

"Cublas can operate on the data resident on the GPU. "

I should have looked more into it, because they returned more or less coherent results, but not 100% correct. Probably was calling SGEMM instead of DGEMM.

Again, thanks to everyone for answering!

David Lisin