Non Square Matrix Multiplication on CUDA Matrix Multiplication Help

David_Lisin · June 18, 2009, 10:59am

Hi everyone, its the first time I post here, but im having problems with matrix multiplication on non square matrixes.
In the CUDA examples, if I use the sdk code, its is valid for square matrixes.
For example, using a BLOCK_SIZE of 16, and two matrixes of 3200x3200 elements, the results are correct.

However, whe using the same BLOCK_SIZE and matrixA=3200x1600 and matrixB=1600x3200, I get incorrect results.
Does anyone know why the cuda example doesnt work, and if it were possible, to give an example of a correct matrix multiplication, because I cant get my head around things!
Thanks in advance,
David Lisin

neocortex · June 18, 2009, 7:31pm

Which SDK example are you referring to? if you are talking about NVIDIA_CUDA_SDK/projects/matrixMul, if you take a look at matrixMul.h, WA HA WB etc… are defined so that they are not square matrices, and when i run the example on my machine, the tests pass.

David_Lisin · June 24, 2009, 7:49am

The test does pass, but doesnt return the correct result. I have prepared a mex file with cuda for Matlab, and it returns a matrix with the results inverted.

the actual problem in the mutliplication is that the kernel is slightly wrong, it should actually be:

// Multiply the two matrices together;

// each thread computes one element

// of the block sub-matrix

for (int k = 0; k < BLOCK_SIZE_VALUE; ++k)

        Csub += As[k][tx] * Bs[ty][k];

The original problem was that the matrixes had to be multiples of 16. I have already got passed that problem, but what i have found now is that the matrix multiplication works with single precision, but unfortunately doesnt (at the moment) with double precision.

Thanks for the response anyway, and hope this helps anyone who is currently developing wiht CUDA.

David Lisin

Nico · June 24, 2009, 8:26am

The CUBLAS library provides double precision matrix arithmetic.

N.

David_Lisin · June 24, 2009, 9:42am

Again, thanks for the reply, but i have the current situation:

I am doing several calculations one after the other in cuda to maximize times. Ill try to set a simple example, just to state the matter:

(MATLAB pseudocode):

for i=1:100

result=A[1:end-i].*B[1:end-i];

result=result*C[1:end-i];

resutl=result*D[i];

result=1./result*E[1:end-1];

end

The thing is, i can copy A,B,C,D,E to the device,only once!!! And do all the calculations in different kernels, but i repeat, copying only once to device, using the result of one kernel as the pointer to the next, speeding things up A LOT. I am using my own kernels for the rest of the functions.

If i use cublas, i need to copy the result of kernel1, back to cpu, reserve memory with cublasmalloc, execute dgemm, copy back, reserver memory of the result for cuda, copy back to device, and then carry on executing. But the data trasfers ruin my speedups…thats why i was wondering if double precision works for anyone on matrix multiplication, since the one offered bu cuda sdk, modified for double precision, and compiled with -arch sm_13 doesnt work.

Thanks again,

David Lisin

mfatica · June 24, 2009, 9:48am

No, you don’t.
Cublas can operate on the data resident on the GPU.

Something similar to this ( the code is for single, you can use double ):

! Allocate matrices on GPU
cublasAlloc(m1m1, size_of_real, devPtrA)
cublasAlloc(m1m1, size_of_real, devPtrB)
cublasAlloc(m1*m1, size_of_real, devPtrC)
!Copy data from CPU to GPU
cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1)
cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1)
cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1)
! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data in GPU memory)
call cublasSGEMM (‘n’,‘n’,m1,m1,m1,alpha,devPtrA,m1,devPtrB,m1,beta,devPtrC,m1)
! Add all the other CUBLAS calls you need.
!Copy data from GPU to CPU
cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1)
! Free memory on device
cublasFree(devPtrA)
…

avidday · June 24, 2009, 9:49am

You don’t. You can leave the results on the device between CUBLAS calls, you can intermingle your own kernels with CUBLAS functions operating on the same data or intermediate results in device memory, and you can use cudaMalloc(), cudaMemcpy() and cudaFree() to manage memory on the device with CUBLAS. CUBLAS pointers are just regular CUDA pointers and can be used interchangeably (remembering that CUBLAS uses Fortran order storage).

David_Lisin · June 24, 2009, 11:08am

I had tried this a couple of weeks ago, but didnt get correct results (obviously my fault, because I have tried it today, and works perfectly)

"Cublas can operate on the data resident on the GPU. "

I should have looked more into it, because they returned more or less coherent results, but not 100% correct. Probably was calling SGEMM instead of DGEMM.

Again, thanks to everyone for answering!

David Lisin

Topic		Replies	Views
Non-square Matrix Multiplication, not getting any cValues back CUDA Programming and Performance	2	3392	July 1, 2009
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5591	November 25, 2009
CUBLAS matrix multiplication CUDA Programming and Performance	5	3803	December 15, 2012
Matrix Multiplication with CUBLAS and MATLAB CUDA Programming and Performance	0	1003	July 1, 2009
Incorrect results when using cublas matrix multiplication GPU-Accelerated Libraries	1	1512	April 28, 2016
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2191	June 8, 2021
Matrix Multiplication Buggy CUDA Programming and Performance	13	5234	May 5, 2010
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	940	August 23, 2018
varying outcomes cublas on gtx 275 CUDA Programming and Performance	4	1458	August 2, 2010
cuBLAS non-square matrix multiplication error CUDA Programming and Performance	0	761	March 4, 2017

Non Square Matrix Multiplication on CUDA Matrix Multiplication Help

Related topics