CUBLAS Vector Multiply

jmready · February 28, 2008, 9:09pm

Does CUBLAS have any function that could be reasonably used to implement element by element multiplication between two vectors?

jeronimoh · February 29, 2008, 8:41am

Good question man…I was trying to solve the same problem during last few days. I was wondering if not only CUBLAS, but any other implementation of BLAS has element-wise vector/vector multiplication implemented. And to be honest, I wasn’t able to find definitive answer yet. But one of my colleagues suggested me to inspect BLAS level 2 routines which implements various types of Ax (matrixvector) operations. That’s because element-wise vector multiplication means nothing more than A*x for diagonal matrix A. I believe this could help you…

seibert · February 29, 2008, 1:30pm

Certainly this is a trivial custom kernel to write. It might be easier to figure out the memory layout for vectors in CUBLAS and use your own kernel.

jeronimoh · February 29, 2008, 2:03pm

I have one note and one question related to your comment ;-). Element-wise multiplication could be of-course implemented using very very trivial user-defined kernel. But in case of iterative techniques based on BLAS (as in my case), there is a well-founded demand to use BLAS operations for all successive steps (SAXPY, GEMMV, SDOT, …). This is the reason which leads me to confidence that element-wise vector*vector should be somehow implemented in BLAS.

My question follows:

I’m using cuMemAlloc() (driver api) instead of cublasAlloc() and cuMemcpyHtoD() instead of cublasSetVector() in my application, because I need to use both CUBLAS routines and user-defined kernels on my pieces of data (large vectors in fact). From what I observed so far, no problems occurred and all operations passes well. Do you think that I’m just lucky man and I should strictly use cublasAlloc and cublasSetVector when using CUBLAS? Thx

mfatica · February 29, 2008, 3:00pm

No, you can mix cublasAlloc and cublasS/GetVector with regular cuda Malloc and Memcpy calls (both driver and high-level API).

The cublas calls are there for convenience (for example if you are calling cublas from Fortran and don’t want to mix C and Fortran)

jeronimoh · March 1, 2008, 10:16am

Thank you very much, your reply raised my confidence in my piece of code :)

uiuc99 · March 22, 2008, 10:41am

This is what I am looking for. It’s good to know that you can mix CUDA with CUBLAS, but how would CUBLAS use the memory (shared, texture, etc)? Does CUBLAS know how many multi-processors you have, and optimize each function according to your system parameters? Without knowing how CUBLAS uses the memory, using CUDA at the same time could cause conflict.

MisterAnderson42 · March 22, 2008, 1:07pm

Every kernel call independently uses all resources on the GPU. There can be no conflicts between two separate kernel calls. But if you are truly curious about CUBlAS’s block and grid parameters, just read the source :)

uiuc99 · March 22, 2008, 4:55pm

Thanks! Yes I like to. Where is CUBLAS source? I can’t find it anywhere.

mfatica · March 22, 2008, 7:56pm

The links are in this post:

[url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

uiuc99 · March 22, 2008, 8:48pm

Thanks a lot for the secret passage leading to the source code ;)

Yes I found where they decide the number of blocks and threads, and whether to use texture memory. It seems to be decided for each function call. Since I have many calls back to back, do they have an optimizer that blends the functions and find a global optimal assignments? Maybe I have to mixture these .h and .cu files by myself.

antonae · April 20, 2009, 3:49pm

I have one note and one question related to your comment ;-). Element-wise multiplication could be of-course implemented using very very trivial user-defined kernel. But in case of iterative techniques based on BLAS (as in my case), there is a well-founded demand to use BLAS operations for all successive steps (SAXPY, GEMMV, SDOT, …). This is the reason which leads me to confidence that element-wise vector*vector should be somehow implemented in BLAS.

My question follows:

I’m using cuMemAlloc() (driver api) instead of cublasAlloc() and cuMemcpyHtoD() instead of cublasSetVector() in my application, because I need to use both CUBLAS routines and user-defined kernels on my pieces of data (large vectors in fact). From what I observed so far, no problems occurred and all operations passes well. Do you think that I’m just lucky man and I should strictly use cublasAlloc and cublasSetVector when using CUBLAS? Thx

I was wondering if you found an efficient way to compute element wise vector multiplication and division.

Also if implementing a custom kernel wouldn’t penalize performance while mixing with cublas routines (i don’t know how to implement a custom kernel yet, about to start reading…)

thanks,

ant

antonae · April 22, 2009, 3:31pm

Answering myself. I coded this kernel:

[codebox]global void m2(float *A, float *B, int maxN){

int i=blockIdx.x*BLOCK_SIZE+threadIdx.x;

if(i<maxN);

B[i]=A[i]*A[i];

__syncthreads();

}[/codebox]

where maxN would be the size of vector A (and also B’s)

mfatica · April 22, 2009, 3:35pm

There is no need for the syncthreads.

mahnaz · April 22, 2009, 5:26pm

Is the source for CUBLAS available somewhere else? Why is it removed?

mfatica · April 22, 2009, 6:25pm

It is now available for registered developers.

Stanimir · May 15, 2009, 11:33am

OK, I’m an registered developer, how/where can I get it?

shaklee3 · January 4, 2014, 4:24am

Sorry to bump such an old topic, but did anyone ever figure out if there was a cublas function to do element vector-vector multiply? They have a dot product function, so I’d assume this existed.

sreeram · June 22, 2017, 7:24am

Hi all,

I would like to perform element-wise multiplication between two vectors using CUBLAS. Could you please share the code for the same. The link mentioned here does not contain the code.

Thanks.

bretin.remy · May 1, 2023, 6:42pm

Well, as the two old previous comments, is there now an easy and better way to do an element wise vectors multiplication ?

My function is not fast enough to my point of view :/

attributes(global) subroutine SUB_vvm(v1,v2,n)
implicit none
integer,value :: i,j
integer,device :: n
real(fp),device :: V(0:n),v1(0:n),v2(0:n),a
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
IF (i <= n) v1(i) = v1(i) * v2(i)
end subroutine SUB_vvm

Thank you in advance for your support,
Remy

Topic		Replies	Views
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5758	November 25, 2009
Element wise vectors multiplication nvc, nvc++ and nvfortran	3	680	May 3, 2023
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	10282	January 20, 2010
Vector-Vector Multiplication Without using CUBLAS CUDA Programming and Performance	2	1072	September 1, 2009
Matrix Vector multiply CUBLAS function CUDA Programming and Performance	4	1674	March 5, 2010
Cublas using cublasSetX memory with another kernel GPU-Accelerated Libraries	6	816	September 19, 2018
Mixing CUDA and CUBLAS possible? Is avalaible the CUDA source code? CUDA Programming and Performance	11	13022	May 8, 2010
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7619	September 1, 2010
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3492	December 26, 2008
Adding two vectors using CUBLAS CUDA Programming and Performance	4	7844	July 12, 2010

CUBLAS Vector Multiply

Related topics