Multiple calls of one CUDA function - Matlab

bizi10 · October 9, 2008, 8:28pm

Im working on a Matlab CUDA function that would be able to initialise and keep two matrices in GPU memory in the first run (from matlab fCUDA(A,C)) and in the second run (from matlab [D]=fCUDA(A,C)) there would be a

matrix multiplication of those two functions using cublasGemm.

All is going well except the final result of cublasGemm returns empty function.

I just want to keep one or two matrices in GPU memory in the first run of the function (using static pointers) and in the second run I would like to make a matrix multiplication.

What am I doing wrong? Did I make a mistake in usage of some pointers that I don’t get a result in the output D matrix? Or is there a problem that I’m keeping the right pointer values but that I’m not in the same memory address space anymore?

Code:

#include "mex.h"

#include "cuda.h"

#include "cublas.h"

#include <stdio.h>

#include <stdlib.h>

static int initialized = 0;

static float *kazalecA;

static float *kazalecF0;

void cleanup(void) {

 Â cudaFree(kazalecA);

 Â cudaFree(kazalecF0);

}

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

 Â float *kazalecAHost;

 Â float *kazalecF0Host;

Â int vrsticaA = mxGetM(prhs[0]);

 Â int stolpecA = mxGetN(prhs[0]);

 Â int vrsticaF0 = mxGetM(prhs[1]);

 Â int stolpecF0 = mxGetN(prhs[1]);

Â if (!initialized) {

 Â  Â  Â cudaMalloc ((void **)&kazalecA, sizeof(float)*vrsticaA*stolpecA);

 Â  Â  Â 

 Â  Â  Â kazalecAHost = (float *)mxGetPr(prhs[0]); Â  Â  

 Â  Â  Â cudaMemcpy(kazalecA, kazalecAHost, sizeof(float)*vrsticaA*stolpecA, cudaMemcpyHostToDevice);

Â  Â  Â cudaMalloc ((void **)&kazalecF0, sizeof(float)*vrsticaF0*stolpecF0);

 Â  Â  Â kazalecF0Host = (float *)mxGetPr(prhs[0]);

 Â  Â  Â cudaMemcpy(kazalecF0, kazalecF0Host, sizeof(float)*vrsticaF0*stolpecF0, cudaMemcpyHostToDevice);

 Â  Â  Â 

initialized = 1;

Â  Â  Â mexAtExit(cleanup);

Â } else {

 Â  Â 

 Â  Â  int dimenzija[2];

 Â  Â  float alpha,beta;

 Â  Â  double *C; 

 Â  Â  float *c;

 Â  Â  float *gc; 

 Â  Â  

 Â  Â  alpha = 1.0; 

 Â  Â  beta = 0.0; Â 

 Â  

 Â  Â  dimenzija[0]=vrsticaA;

 Â  Â  dimenzija[1]=stolpecF0;

 Â  Â 

 Â  Â  plhs[0] = mxCreateNumericArray(2,dimenzija,mxDOUBLE_CLASS,mxREAL);

 Â  Â 

 Â  Â  C = mxGetPr(plhs[0]);

 Â  Â  c Â = (float*) mxMalloc(sizeof(float)*vrsticaA*stolpecF0);

Â  Â  cublasInit(); //inicalizacija cublas knjiznice

Â  Â  cublasAlloc (stolpecF0*vrsticaA, sizeof(float), (void**)&gc);

 Â  Â  cublasSetMatrix (vrsticaA, stolpecF0, sizeof(float),c, vrsticaA, (void*)gc, vrsticaA);

Â  Â  (void) cublasSgemm ('n','n',vrsticaA,stolpecF0,stolpecA,alpha,kazalecA,vrsticaA,kazalecF0,vrsticaF0,beta,gc,vrsticaA);

Â  Â cublasGetMatrix (vrsticaA, stolpecF0, sizeof(float), Â gc, vrsticaA, c, vrsticaA);

 Â  Â  cublasFree(gc);

 Â  Â  cublasShutdown(); Â 

Â  Â }

}

Someone wrote me regarding a similar function:

As I only have one mexfunction can someone please clarify if I’m in the same memory address space in the second iteration?

E.D_Riedijk · October 10, 2008, 4:37am

two things:
you are copying double arrays to the GPU which expects float arrays, so you result would be incorrect
you are not converting the floats in array c to doubles in the output array C
For all these problems check the example code that came with the matlab plugin

bizi10 · October 10, 2008, 6:01am

The input matrices are generated as A=single(rand(512)); do I still need to convert them to float before sending them to GPU? Regarding output array I will convert the output to doubles as is documenter in matlab plugin example.

But the big question is if the pointers are correct? Am I in the same address space when I run the function for the second time?

E.D_Riedijk · October 10, 2008, 12:13pm

Your input will be okay indeed.

As far as I can see it will work if you actually output the data to MATLAB. But if you actually got an empty return, then you indeed are not entering the second part of the if, the second time you run it. You can try by putting and mexPrintf(“%d\n”, initialized); in the beginning of your mexFunction.

bizi10 · October 11, 2008, 9:29pm

Hi,

Thanks for help I’m making good progress now with the function.

Now I have one mexFunction that I’m using for different calls on the GPU (initialization, iteration, clear).

The matrices stay in the memory and multiplication of two matrices is working fine.

So now I came to the part where I willl have to make some more mathematical operations.

To get a clear idea on what I need to do let me first specify some inputs:

ai = 512x1 matrix

at = 1x512 matrix

y= 1x512 matrix

Ai=512x512 matrix

now I need to compute:

a*at - (matrix multiplication - already done - cublasSgemm)
1+ y*ai - (matrix multiplication and addition - this will be a scalar I think)
B= (aat)/(1+ ya) - matrix division of 512x512 matrix with some scalar)
A=Ai-B (a substraction of two 512x512 matrices)

So basicaly I need:

matrix division with scalar
substraction of two matrices

Could you please give me some help on how to implement such thing in CUDA or cublas enviroment? Are the any ready made functions matrix division and substraction? (like cublasSgemm for matrix multiplication)? If they are not how to implement them? Or would it be more optimal to make the division and substraction in Matlab environment and only do the multiplication on the GPU?

bizi10 · October 12, 2008, 9:30am

Also if I already have a matrix in a GPU memory is there an easy way to transpose it?

E.D_Riedijk · October 12, 2008, 6:35pm

there is a transpose example in the SDK, but most blas functions seem to have an option to transpose things.
dividing by a scalar is multiplying with the reciprocal. This is all basic blas stuff as far as I can see, no need to program stuff by hand.

I am not sure you will see so much speedup with 512x512 matrices though.

bizi10 · October 13, 2008, 6:31am

Now that’s a relief (blas functions able to transpose). Regarding speedup it’s not that important, I can still increase the size of the matrix later, but first I must make a proof of concept.

EDIT:

So if I understand corectly to make a matrix multiplication of Matrix A and Matrix B (and I want transposed matrix B ) I need to call cublasSgemm like this:

(void) cublasSgemm (‘n’,‘t’,rA,cD,cA,alpha,pAgpu,rA,pDgpu,rD,beta,gc,rA);

So that transb parameter will do the transpose of the B matrix before multiplication?

Multiplying with reciprocal then can be done in cublasSgemm with setting the value for beta?

Could you please give me an example on how to do addition/substraction of two matrices using blas functions?

And many thanks for helping me out I really apreciate it as your feedback help’s me a lot. :">

E.D_Riedijk · October 13, 2008, 5:29pm

don’t know about this in detail, I have never used BLAS. Maybe check some BLAS docs? Otherwise, a kernel that adds 2 matrices of the same size (or substracts them) is the simplest you can do.

__global__ void sub_kernel(float *A, float * B, float *C, unsigned int size_x, unsigned int size_y)

{

unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

unsigned int idy = threadIdx.y + blockIdx.y * blockDim.y;

unsigned int index = idx + idy * size_x;

if ((idx < size_x) && (idy < size_y))

C[index] = A[index] - B[index];

}

and call it like this :

dim3 grid((unsigned int) ceilf((float) size_x/16.0f), (unsigned int) ceilf((float) size_y/16.0f), 1);

dim3 block(16,16,1);

add_kernel<<<grid, block>>>(A, B, C, size_x, size_y);

to get C = A-B;

janakasoft · March 4, 2009, 5:22pm

common CUDA MATLAB issues and solutions are posted here
[url=“http://cs.ucf.edu/~janaka/gpu/”]http://cs.ucf.edu/~janaka/gpu/[/url]

Topic		Replies	Views
Help with CUDA+Matlab acceleration CUDA Programming and Performance	24	14493	December 17, 2008
Matlab mex file using cublas - problems CUDA Programming and Performance	13	8949	October 13, 2009
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	935	August 23, 2018
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10000	January 18, 2012
Non-square Matrix Multiplication, not getting any cValues back CUDA Programming and Performance	2	3392	July 1, 2009
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5580	November 25, 2009
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5597	August 12, 2009
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2183	June 8, 2021
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3404	July 25, 2009
Strange "unspecified launch error" from a call to cublas gemm CUDA Programming and Performance	23	2673	January 19, 2019

Multiple calls of one CUDA function - Matlab

Related topics