and do you wish to commence from the host or device?
from the host, I would think that a cudamemcpy for each matrix row would be an elementary solution
from the host, you can use a number of thread blocks to distribute the vector to the different matrix rows, use a single block to read in the vector to shared memory, and then distribute it to each matrix row, or even dynamic parallelism to have single thread initiate a number of cudamemcpy’s, similar to the host solution mentioned above
If you want to go beyond something that just copies the data to duplicate it,the most efficient way is to just not copy at all. Change the code that reads a matrix where a vector is enough.