Multiple calls of one CUDA function - Matlab

Im working on a Matlab CUDA function that would be able to initialise and keep two matrices in GPU memory in the first run (from matlab fCUDA(A,C)) and in the second run (from matlab [D]=fCUDA(A,C)) there would be a

matrix multiplication of those two functions using cublasGemm.

All is going well except the final result of cublasGemm returns empty function.

I just want to keep one or two matrices in GPU memory in the first run of the function (using static pointers) and in the second run I would like to make a matrix multiplication.

What am I doing wrong? Did I make a mistake in usage of some pointers that I don’t get a result in the output D matrix? Or is there a problem that I’m keeping the right pointer values but that I’m not in the same memory address space anymore?


#include "mex.h"

#include "cuda.h"

#include "cublas.h"

#include <stdio.h>

#include <stdlib.h>

static int initialized = 0;

static float *kazalecA;

static float *kazalecF0;

void cleanup(void) {

 Â cudaFree(kazalecA);

 Â cudaFree(kazalecF0);


void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])


 Â float *kazalecAHost;

 Â float *kazalecF0Host;

 int vrsticaA = mxGetM(prhs[0]);

 Â int stolpecA = mxGetN(prhs[0]);

 Â int vrsticaF0 = mxGetM(prhs[1]);

 Â int stolpecF0 = mxGetN(prhs[1]);

 if (!initialized) {

 Â  Â  Â cudaMalloc ((void **)&kazalecA, sizeof(float)*vrsticaA*stolpecA);

 Â  Â  Â 

 Â  Â  Â kazalecAHost = (float *)mxGetPr(prhs[0]); Â  Â  

 Â  Â  Â cudaMemcpy(kazalecA, kazalecAHost, sizeof(float)*vrsticaA*stolpecA, cudaMemcpyHostToDevice);

  Â  Â cudaMalloc ((void **)&kazalecF0, sizeof(float)*vrsticaF0*stolpecF0);

 Â  Â  Â kazalecF0Host = (float *)mxGetPr(prhs[0]);

 Â  Â  Â cudaMemcpy(kazalecF0, kazalecF0Host, sizeof(float)*vrsticaF0*stolpecF0, cudaMemcpyHostToDevice);

 Â  Â  Â 

initialized = 1;

  Â  Â mexAtExit(cleanup);

 } else {

 Â  Â 

 Â  Â  int dimenzija[2];

 Â  Â  float alpha,beta;

 Â  Â  double *C; 

 Â  Â  float *c;

 Â  Â  float *gc; 

 Â  Â  

 Â  Â  alpha = 1.0; 

 Â  Â  beta = 0.0; Â 


 Â  Â  dimenzija[0]=vrsticaA;

 Â  Â  dimenzija[1]=stolpecF0;

 Â  Â 

 Â  Â  plhs[0] = mxCreateNumericArray(2,dimenzija,mxDOUBLE_CLASS,mxREAL);

 Â  Â 

 Â  Â  C = mxGetPr(plhs[0]);

   Â  c  = (float*) mxMalloc(sizeof(float)*vrsticaA*stolpecF0);

  Â  cublasInit(); //inicalizacija cublas knjiznice

  Â  cublasAlloc (stolpecF0*vrsticaA, sizeof(float), (void**)&gc);

 Â  Â  cublasSetMatrix (vrsticaA, stolpecF0, sizeof(float),c, vrsticaA, (void*)gc, vrsticaA);

  Â  (void) cublasSgemm ('n','n',vrsticaA,stolpecF0,stolpecA,alpha,kazalecA,vrsticaA,kazalecF0,vrsticaF0,beta,gc,vrsticaA);

  Â cublasGetMatrix (vrsticaA, stolpecF0, sizeof(float),  gc, vrsticaA, c, vrsticaA);

 Â  Â  cublasFree(gc);

 Â  Â  cublasShutdown(); Â 

  Â }


Someone wrote me regarding a similar function:

As I only have one mexfunction can someone please clarify if I’m in the same memory address space in the second iteration?

two things:
you are copying double arrays to the GPU which expects float arrays, so you result would be incorrect
you are not converting the floats in array c to doubles in the output array C
For all these problems check the example code that came with the matlab plugin

The input matrices are generated as A=single(rand(512)); do I still need to convert them to float before sending them to GPU? Regarding output array I will convert the output to doubles as is documenter in matlab plugin example.

But the big question is if the pointers are correct? Am I in the same address space when I run the function for the second time?

Your input will be okay indeed.

As far as I can see it will work if you actually output the data to MATLAB. But if you actually got an empty return, then you indeed are not entering the second part of the if, the second time you run it. You can try by putting and mexPrintf("%d\n", initialized); in the beginning of your mexFunction.


Thanks for help I’m making good progress now with the function.

Now I have one mexFunction that I’m using for different calls on the GPU (initialization, iteration, clear).

The matrices stay in the memory and multiplication of two matrices is working fine.

So now I came to the part where I willl have to make some more mathematical operations.

To get a clear idea on what I need to do let me first specify some inputs:

ai = 512x1 matrix

at = 1x512 matrix

y= 1x512 matrix

Ai=512x512 matrix

now I need to compute:

  • a*at - (matrix multiplication - already done - cublasSgemm)

  • 1+ y*ai - (matrix multiplication and addition - this will be a scalar I think)

  • B= (aat)/(1+ ya) - matrix division of 512x512 matrix with some scalar)

  • A=Ai-B (a substraction of two 512x512 matrices)

So basicaly I need:

  • matrix division with scalar

  • substraction of two matrices

Could you please give me some help on how to implement such thing in CUDA or cublas enviroment? Are the any ready made functions matrix division and substraction? (like cublasSgemm for matrix multiplication)? If they are not how to implement them? Or would it be more optimal to make the division and substraction in Matlab environment and only do the multiplication on the GPU?

Also if I already have a matrix in a GPU memory is there an easy way to transpose it?

there is a transpose example in the SDK, but most blas functions seem to have an option to transpose things.
dividing by a scalar is multiplying with the reciprocal. This is all basic blas stuff as far as I can see, no need to program stuff by hand.

I am not sure you will see so much speedup with 512x512 matrices though.

Now that’s a relief (blas functions able to transpose). Regarding speedup it’s not that important, I can still increase the size of the matrix later, but first I must make a proof of concept.


So if I understand corectly to make a matrix multiplication of Matrix A and Matrix B (and I want transposed matrix B ) I need to call cublasSgemm like this:

(void) cublasSgemm (‘n’,‘t’,rA,cD,cA,alpha,pAgpu,rA,pDgpu,rD,beta,gc,rA);

So that transb parameter will do the transpose of the B matrix before multiplication?

Multiplying with reciprocal then can be done in cublasSgemm with setting the value for beta?

Could you please give me an example on how to do addition/substraction of two matrices using blas functions?

And many thanks for helping me out I really apreciate it as your feedback help’s me a lot. :">

don’t know about this in detail, I have never used BLAS. Maybe check some BLAS docs? Otherwise, a kernel that adds 2 matrices of the same size (or substracts them) is the simplest you can do.

__global__ void sub_kernel(float *A, float * B, float *C, unsigned int size_x, unsigned int size_y)


unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

unsigned int idy = threadIdx.y + blockIdx.y * blockDim.y;

unsigned int index = idx + idy * size_x;

if ((idx < size_x) && (idy < size_y))

C[index] = A[index] - B[index];


and call it like this :

dim3 grid((unsigned int) ceilf((float) size_x/16.0f), (unsigned int) ceilf((float) size_y/16.0f), 1);

dim3 block(16,16,1);

add_kernel<<<grid, block>>>(A, B, C, size_x, size_y);

to get C = A-B;

common CUDA MATLAB issues and solutions are posted here