How can I call cublasSgemmBatched on pointer device arrays without allocating them twice?

I’m trying to figure out how to use cublasSgemmBatched. So far I have managed to create a working version that takes an (m x n x k) matrix A from the host, multiplies it with an (n x q x k) matrix B from the host and returns the resulting (m x q x k) matrix C to the host. This is my code (which is heavily inspired by the example found at

#include "mex.h"
#include "cublas_v2.h"

// The MEX gateway function.
void mexFunction(int nlhs, mxArray ∗plhs[], int nrhs,const mxArray ∗prhs[]) {

// Get input variables from Matlab (host variables).
const float *A, *B;
A = (float*)mxGetPr(prhs[0]);
B = (float*)mxGetPr(prhs[1]);

// Get dimensions of input variables from Matlab.
size_t m, n, k, q;
const mwSize *Adims, *Bdims;
Adims = mxGetDimensions(prhs[0]);
Bdims = mxGetDimensions(prhs[1]);
m = Adims[0];
n = Adims[1];
k = Adims[2];
q = Bdims[1];

// Create host version of resulting C matrix.
float *C;
mwSize dims[3] = { m, q, k };
plhs[0] = mxCreateNumericArray(3, dims, mxSINGLE_CLASS, mxREAL);
C = (float*)mxGetPr(plhs[0]);

// Device versions of A, B and C. Allocated in a strange way.
float **devA = (float **)malloc(k * sizeof(float**));
float **devB = (float **)malloc(k * sizeof(float**)); 
float **devC = (float **)malloc(k * sizeof(float**));

// size of each A, B and C page.
size_t A_size = sizeof(float) * m * n; 
size_t B_size = sizeof(float) * n * q;
size_t C_size = sizeof(float) * m * q;

// Allocate space page by page.
for(int i = 0 ; i < k; i ++){
     cudaMalloc((void**)&devA[i], A_size);
     cudaMalloc((void**)&devB[i], B_size);
     cudaMalloc((void**)&devC[i], C_size);

// Cublas handle.
cublasHandle_t h;

// Create arrays of pointers that cublasSgemmBatched will use.
const float **d_Aarray, **d_Barray;
float **d_Carray;
cudaMalloc((void**)&d_Aarray, k * sizeof(float *));
cudaMalloc((void**)&d_Barray, k * sizeof(float *));
cudaMalloc((void**)&d_Carray, k * sizeof(float *));

// Copy matrices A and B are here copied from host, page by page, into devA and devB.
for(int i = 0 ; i < k; i ++ ) {
    cudaMemcpy(devA[i], A + m * n * i, A_size , cudaMemcpyHostToDevice);
    cudaMemcpy(devB[i], B + n * q * i, B_size , cudaMemcpyHostToDevice);
    cudaMemcpy(devC[i], C + m * q * i, C_size , cudaMemcpyHostToDevice);

// Pointers pointing to locations in the real matrices are copied into pointer arrays.
cudaMemcpy(d_Aarray, devA, k * sizeof(float *), cudaMemcpyHostToDevice);
cudaMemcpy(d_Barray, devB, k * sizeof(float *), cudaMemcpyHostToDevice);
cudaMemcpy(d_Carray, devC, k * sizeof(float *), cudaMemcpyHostToDevice);

// Run cublasSgemmBatched.
const float alpha = 1.0;
const float beta = 0.0;

cublasSgemmBatched(h,  // cublas handle.
  	 CUBLAS_OP_N,  // Dont transpose A.
	 CUBLAS_OP_N,  // Dont transpose B.
                   m,  // Rows of A.
		   q,  // Columns of B.
		   n,  // Columns of A (after potential transpose).
 	      &alpha,  // Scalar alpha.
            d_Aarray,  // Pointerarray to A.
 		   m,  // ldA.
	    d_Barray,  // Pointerarray to B.
	           n,  // ldB.
	       &beta,  // Scalar beta.
            d_Carray,  // Pointerarray to C.
		   m,  // ldC.
                   k); // Number of pages to batch.

// Free memory & destroy cublas handle.
for(int i = 0 ; i < k; i ++ ){
    cudaMemcpy(C + m * q * i, devC[i], C_size, cudaMemcpyDeviceToHost);

However, I would like to use cublasSgemmBatched in a calculation sequence where the matrices A and B are already on the device to begin with, not on the host as in the case in my current implementation.

Say that matrices A and B for example were inserted at the very start of my function like this:

float *devA, *devB, *devC;
cudaMalloc(&devA, m * n * k * sizeof(float));
cudaMalloc(&devB, n * q * k * sizeof(float));
cudaMalloc(&devC, n * q * k * sizeof(float));
cudaMemcpy(devA, A, m * n * k * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(devB, B, n * q * k * sizeof(float), cudaMemcpyHostToDevice);

How would I then go about creating the pointer arrays d_Aarray, d_Barray, d_Carray that cublas wants without having to make any additional copies of dA, dB, dC? In other words, what would I have to change in the code after line 44 to make it work with dA, dB and dC if they are given as shown above?

The solution is probably trivial to people with a deep understanding of double pointers and arrays of pointers etc. But since my understanding of these things is currently overwhelmingly shallow its far from obvious to me how I should create these pointer arrays.

This should be just pointer arithmetic.

In your first code snippet, substitute the following as a replacement for lines 56-66 (your code in your second snippet is already replacing lines 56-61, effectively):

// allocate host storage for pointer arrays
float **ptrA, **ptrB, **ptrC;
ptrA = (float **)(malloc(k*sizeof(float*));
ptrB = (float **)(malloc(k*sizeof(float*));
ptrC = (float **)(malloc(k*sizeof(float*));

// pointer arithmetic to offset each pointer to the next matrix
for (int i = 0; i < k; i++){
  ptrA[i] = devA+(i*m*n);
  ptrB[i] = devB+(i*n*q);
  ptrC[i] = devC+(i*m*q);}

// Pointers pointing to locations in the real matrices are copied into pointer arrays.
cudaMemcpy(d_Aarray, ptrA, k * sizeof(float *), cudaMemcpyHostToDevice);
cudaMemcpy(d_Barray, ptrB, k * sizeof(float *), cudaMemcpyHostToDevice);
cudaMemcpy(d_Carray, ptrC, k * sizeof(float *), cudaMemcpyHostToDevice);

By the way, this line of code in your second snippet looks wrong:

cudaMalloc(&devC, n * q * k * sizeof(float));

shouldn’t it be:

cudaMalloc(&devC, m * q * k * sizeof(float));

Fantastic, seems to work fine! Thank you Bob.

And yes, you are of course correct that devC should be m * q * k.

Just one small follow-up question: when you create ptrA, ptrB and ptrC like this:

ptrA = (float **)(malloc(k*sizeof(float*));

They are intended as pointers on the host memory pointing to locations on the device memory if I understand this correct. So should they then be deallocated using:




at the end of my program? I would assume free(), but I have seen examples of similar things where host memory is freed using cudaFree which I find confusing. My program compiles and runs using both alternatives but I’m concerned about memory leaks etc. if I don’t do it the right way.

If you allocate with malloc, you free with free.

If you allocate with cudaMalloc (or cudaMallocManaged, or cudaHostAlloc, or cudaMallocHost) you free with cudaFree.

So you should use free for those particular pointers (ptrA, ptrB, ptrC).

If you see an example where host memory (allocated e.g. with malloc or new in host code) is freed with cudaFree, that is a bug. Proper cuda error checking on that cudaFree call should highlight that (even if you are doing it wrong in your own code).

If, by “host memory”, you mean memory allocated with cudaMallocManaged, then that should indeed be freed with cudaFree, even if you like to call such an allocation “host memory” (I would not call it that, however).

Since you mention that it seems to work either way in your own code, this is a good time for me to reinforce a basic concept:

  1. If you’re a novice CUDA programmer, you should always be using proper CUDA error checking.
  2. Any time you’re having trouble with a CUDA code, you should always be using proper CUDA error checking, before asking others for help.

In your own program, check the CUDA error return from that cudaFree call that you wrote, that is attempting to free ptrA or ptrB or ptrC. It will return an error. Then you would be in less of a quandary about “what is right”. You would know, at least, that using cudaFree that way is wrong, because the error checking would tell you that.

Use proper CUDA error checking. Don’t be “that guy” who doesn’t, and then asks others for help “what is wrong with my code?”