cublas: Using dgemv

I am new to CUDA and to cublas. I have a question:

I simply want to perform a matrix-vector mutliply on a general double precision matrix-vector. (My GPU is compute capability 1.3 so it can do double precision.)

I noticed there is no function simply for a matrix-vector multiply. The nearest match is dgemv, which is: r = alpha * A * x + beta * y.

Obviously, I can simply set alpha = 1.0 and beta = 0.0 to get the same behavior. But this leads me to a twofold question:

(1) Is the library smart enough to know that since alpha = 1 and beta = 0 it should not perform all of those extra multiplies and adds? (It seems like using C++ templates could help out here…)

(2) Do I need to allocate space on the GPU for a “dummy” variable y, or can I simply pass NULL for y without causing major issues?

(Or perhaps there is a different function I should be using instead?)

I think it is, but for reasonable size problems, it probably doesn’t make a great deal of different. The operation count of gemv is notionally 2MN, the additional constants only add another 2M. You might expect 2MN >> 2M for anything other than trivially small cases, so the overall effect on computation time is probably not all that large.

Passing NULL probably won’t work, but passing the vector twice with beta = 0 should be safe.

I think it is, but for reasonable size problems, it probably doesn’t make a great deal of different. The operation count of gemv is notionally 2MN, the additional constants only add another 2M. You might expect 2MN >> 2M for anything other than trivially small cases, so the overall effect on computation time is probably not all that large.

Passing NULL probably won’t work, but passing the vector twice with beta = 0 should be safe.

It turns out I had read the documentation wrong the first time, oops…hadn’t noticed it is actually: y = alpha * A * x + beta * y (not r = alpha * A * x + beta * y).

In my particular application, M >> N so the O(MN) operation is approximately O(M) with M large (several thousand) but N small (less than 10…generally 2 or 3); adding more O(M) work would significantly impact runtime, so if someone could answer #1 definitively I would greatly appreciate it.

It turns out I had read the documentation wrong the first time, oops…hadn’t noticed it is actually: y = alpha * A * x + beta * y (not r = alpha * A * x + beta * y).

In my particular application, M >> N so the O(MN) operation is approximately O(M) with M large (several thousand) but N small (less than 10…generally 2 or 3); adding more O(M) work would significantly impact runtime, so if someone could answer #1 definitively I would greatly appreciate it.