Cublas matrix dot product ?

I need a cublas function that perform the dot product (component for component) of two matrix .
The only function I found is cublasSdot() but it’s for vector, and cublasSgemm() perform matrix product not dot product.

Now I think that it’s possible use cublasSdot() and put input a vector of size n*m that will be my matrix, there are some built-in cublas function that enable a matrix dot product ??

No, there is not such matrix dot product in cublas.

Can you be more explicit in your matrix dot product explanation. Do you want to get N dot product results ( each dot product is made using 2 columns of each matrix ) and do you want only one result ( considering one matrix as an mxn vector). In the latter case, you can simply use cublasSdot ( assuming that your matrix is in column-major format with lda = m)

I copy below CUDA code for this matrix product, it’s just a product of two matrix component for component, if I have two matrix A and B of size nm the output matrix C will be at size nm, where each component C(i,j)= A(i,j)*B(i,j)

__global__ void productMatrixCompGPU(int*in1, int*in2, int*out,introw, intcol)


int indexRow=threadIdx.x + blockIdx.x*blockDim.x;

int indexCol=threadIdx.y + blockIdx.y*blockDim.y;

if(indexRow<row&& indexCol<col)



I need to execute benchmark for this work using a kernel CUDA and Cublas function .

What you are looking for, also called Hadamard product (SHAD*) , it is not part of CUBLAS.
You will need to write a simple kernel. It may be faster to treat the matrices as 1D arrays ( if you are transforming the full matrices and not a subset)

There is nothing like this in Cublas. cublasSdot is NOT what you want. This operation is really memory-bound, only one Multiplication for 2 loads and 1 write in global memory.

So your current implementation must be already good.

Maybe you could compute multiple points per thread (especially if your matrices are big because they might not fit in your launch grid).

I want that a thread work on only one matrix component, so I want know the maximum size of my matrix…

Finally, how many thread I can launch in a cuda Kernel ?