Adding two vectors using CUBLAS

HyoukJoong_Lee · July 11, 2010, 8:51pm

Hi all,

This might rather be a general BLAS API question.
If I want to add two vectors and store it into another vector, (C[i] = A[i] + B[i])
how could I use CUBLAS API to do this?
It seems that there isn’t a single API that can calculates the above operation.
What I think could do is to use 2 BLAS calls, one is to copy A[i] to C[i] and another one is to do (C[i] = C[i] + B[i]).
Is this the only way to do above calculation using CUBLAS?
The kernel itself is extremely simple, but I’d like to know how other people do this using BLAS API.
Thanks.

HyoukJoong.

LSChien · July 11, 2010, 9:42pm

cublasSaxpy (int n, float alpha, const float *x,
int incx, float *y, int incy)

set alpha = 1, but this would overwrite y, so you need

(1) copy B to y
(2) compute y = alpha*A + y

HyoukJoong_Lee · July 11, 2010, 9:58pm

Thanks for your reply.

But what you said still calls 2 BLAS functions as I mentioned.

And I think this is not good for performance compared to a single handwritten kernel call for C = A + B,

because it would transfer the same data (B and y) twice, thus using more memory bandwidth.

I was wondering if there is a single BLAS call I could use to do the calculation.

Is this the only solution for the operation?

Thanks.

HyoukJoong.

LSChien · July 12, 2010, 4:11am

As far as I know, all BLAS rouines will overwrite right hand side vector/matrix.

you have two choices,

write a kernel yourself
use compiler (commercial) to transfer sequential code to CUDA code.

then you don’t need to write CUDA code.

MMB · July 12, 2010, 2:11pm

If A and B are already on the device, then using a kernel to accomplish what you want to do could be beneficial. This kernel would be very simple to write. However, if you have to transfer A and B to the device and then retrieve C, then the communication time would be very expensive as there is very little for the device to do (add two numbers). In the latter case this is best left for the CPU to do.

MMB