How to efficeint repeat a vector to a matrix in cuda?

I want to repeat vector to form a matrix in cuda, avoiding too many memcopy. Both vector and matrix are allocated on GPU.

For example:

I have a vector: a = [1 2 3 4]

expand it to a matrix: b = [1 2 3 4; 1 2 3 4; … 1 2 3 4]

What I have tried is to assign each element of b. But this involves a lot of GPU memory to GPU memory copy.

I know this is easy in matlab (using repmat), but how to do it in cuda efficiently? I didn’t find any routine in cublas.

and do you wish to commence from the host or device?

from the host, I would think that a cudamemcpy for each matrix row would be an elementary solution

from the host, you can use a number of thread blocks to distribute the vector to the different matrix rows, use a single block to read in the vector to shared memory, and then distribute it to each matrix row, or even dynamic parallelism to have single thread initiate a number of cudamemcpy’s, similar to the host solution mentioned above

If you want to go beyond something that just copies the data to duplicate it,the most efficient way is to just not copy at all. Change the code that reads a matrix where a vector is enough.

I can give you a very basic way to turn it into a matrix -
Have a look at the ger function within cublas.

Apply it to 1vectorv or vtranspose(1vector) (depends on orientation)

That’s brilliant, man!