Hi there!

I want to implement the following algorithm: given a matrix A of size MxN and a vector v of size 1xN, multiply each row of A by v element-wise. The result should be a new matrix B having the same size of A.

I did a bit of research online and couldn’t find any existing worked and optimized example. This operation is commonly used in DSP to use the vector as weights to limit the spectral content of an array of 1D FFTs.

I implemented a trivial kernel by letting each thread do the element-wise multiplication as:

```
__global__
void vec_mat_rowwise_mul(
float* mat,
float* vec,
float* out,
int N,
int M)
{
int row = blockIdx.x*blockDim.x+threadIdx.x;
if (row < M) {
// each thread computes vector by row multiplication
for (int i = 0; i < N; i++) {
out[row * N + i] = mat[row * N + i] * vec[i];
}
}
```

}

I would like to know if there are better and more optimized ways to perform this operation. I’ll be happy if you could also point me to some reading material!