I have a 10x10 matrix A, and a 10x1 matrix D. I want to do AD. I do AD of these matrices in matlab and in a CPU version of code already in the software I’m writing, and they get identical results. I’m not writing any kernels yet so I was trying to just use CUBLAS for everything, and I have everything working but this function.
I have this call in my source:
if(!GPUPLA.gemv(D,A,TEMP,1)) { algo_->error(“GEMV error.”); return 0;}
Which is just to abstract the actual cuda calls to another file. The function called is here:
bool
GPULinearAlgebra::gemv(GPUVector& b, GPUMatrix& a, GPUVector& r, double n)
{
cublasDgemv(‘N’, a.rows_, a.columns_, n, a.data_, a.rows_, b.data_, 1, 1, r.data_, 1);
if(!check_error(cublasGetError())) { printf(“GEMV Failed\n”); return false; }
return true;
}
The TEMP matrix is 10x1 and already allocated with just 0’s. The way I understood it from the CUBLAS library PDF, I thought this should work.
I’m getting the wrong answers as is. I’ve verified D and A have the correct values before the call by pulling them off the GPU and printing them.
On the basis of what you posted it is hard to say. Presuming you have the memory management and copying side of things correct (you haven’t shown any code to it is impossible to say one way or the other), the obvious place people often go wrong with CUBLAS is passing row major ordered arrays. CUBLAS is a FORTRAN ordered BLAS, not C ordered, so input matrices need to be written in column major order.
I had read about that and thought I was ok, but now that I think about it, you are probabaly right. I assumed I was safe since my dot products were working fine. Now that I’m looking at it when I print it out, it is in row major. I didn’t notice it previously probably because the other data sets I were using had A matrices that were symmetrical.
This is unfortunate because these matrices are imported in a different area of the program that I am not touching, so I would lose some speed by needing to translate the matrices.
If I use the translate option in gemv to make it A transpose, that would essentially be the same as going from row major to column major for this simple example, correct? Unless translating it before I store the matrices would be faster. I’m unaware of the time comprimise for gemv if you have it use the transpose functionality.
Unfortunately not necessarily. Depending on how you are allocating the GPU memory, there can be padding/alignment rows added to the storage for memory access performance reasons. So theoretically you are correct, you could just use the transpose and it should work. But you should be very sure of the memory layout of your matrices in the GPU before you do so.
I’m just using the basic cublas add vector and add matrix functions. I’ll try to verify though. I’m pretty positive the values are in row major in host memory before i send them to device, and when I pull them out. I’m guessing when they are on the device they aren’t truly how I thought they were.