cublasSgemm on Submatrices iteratively

Hey all,

I have a simple question but it seems that I am struggling over here since I am a bit new to CUDA and cublas.

I would like to perform a simple matrix multiplication using cublasSgemm to perform C = -2AB’.

Depending on the size I might need to split the query to the card so that lets say i perform it block by block i.e. get the first half of matrix A and then the second half of matrix A to fill in my C matrix.

While the result is ok when using the full matrices, when i split it in the way I mentioned i run into problems.

Is that because the matrices are stored in a column major? It should not matter I guess but just asking…

My other guess is that the leading dimension should change but am not sure…

Here is the code of exactly that part.

So if q_dev(m,k), r_dev(n,k) and d_dev(m,n)

for (int i=0; i<size; i+=query_size)


   result = cudaMemcpy(q_dev, &q_host[i], query_size*size_of_float * k, cudaMemcpyHostToDevice);

cublasOperation_t transa = CUBLAS_OP_N;

   cublasOperation_t transb = CUBLAS_OP_T;

   const float alpha = -2.0;

   const float beta = 0.0;

   status = cublasSgemm(handle, transa, transb, (int)m, (int)n, k, &alpha, q_dev, m, r_dev, n, &beta, d_dev, m);

result = cudaMemcpy(&d_host[i], d_dev, query_size * size_of_float * n, cudaMemcpyDeviceToHost);


In this example whenever query_size = m everything is fine.

If I try to use query_size = arbitrary I dont get the correct results.

Any idea ?

Thank you in advance.