Element wise vectors multiplication

Hello :)

With all the new update and all, I was wondering if there is a better way to compute an element wise vectors multiplication ?

My function is not fast enough to my point of view :/

attributes(global) subroutine SUB_vvm(v1,v2,n)
implicit none
integer,value :: i,j
integer,device :: n
real(fp),device :: v1(0:n),v2(0:n),a
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
IF (i <= n) v1(i) = v1(i) * v2(i)
end subroutine SUB_vvm

Thank you in advance for your support,

Hi Remy,

There’s not much more that can be done here given the code is memory bound and will only run as fast as the GPU memory allows.

Note that you’re not setting the “0” index as your “i” is starting at 1.


Hi Mat,

Thank you for your answer.

Well, I’ve coded my own ‘axpy’ function and it was faster using cublas library.
So, I was just hoping there were a cublas function to do element wise vectors multiplication.

Concerning the index ‘0’, yes I know, it is on purpose.

Thank you again,
Have a great day,

I can’t understand how a cublas saxpy can be very much faster than a simple coded loop. Was it much faster, or just a little bit faster? Also above, is N really on the device, or do you pass it into the kernel by value?

I think in cublas, your only option might be sgemv with N = 1. Cutensor has elementwise operations, but the setup cost might be quite a bit higher.