Hey all,
I have a simple question but it seems that I am struggling over here since I am a bit new to CUDA and cublas.
I would like to perform a simple matrix multiplication using cublasSgemm to perform C = -2AB’.
Depending on the size I might need to split the query to the card so that lets say i perform it block by block i.e. get the first half of matrix A and then the second half of matrix A to fill in my C matrix.
While the result is ok when using the full matrices, when i split it in the way I mentioned i run into problems.
Is that because the matrices are stored in a column major? It should not matter I guess but just asking…
My other guess is that the leading dimension should change but am not sure…
Here is the code of exactly that part.
So if q_dev(m,k), r_dev(n,k) and d_dev(m,n)
for (int i=0; i<size; i+=query_size)
{
result = cudaMemcpy(q_dev, &q_host[i], query_size*size_of_float * k, cudaMemcpyHostToDevice);
cublasOperation_t transa = CUBLAS_OP_N;
cublasOperation_t transb = CUBLAS_OP_T;
const float alpha = -2.0;
const float beta = 0.0;
status = cublasSgemm(handle, transa, transb, (int)m, (int)n, k, &alpha, q_dev, m, r_dev, n, &beta, d_dev, m);
result = cudaMemcpy(&d_host[i], d_dev, query_size * size_of_float * n, cudaMemcpyDeviceToHost);
}
In this example whenever query_size = m everything is fine.
If I try to use query_size = arbitrary I dont get the correct results.
Any idea ?
Thank you in advance.