Subtracting a vector from every column of a matrix

Arno · April 27, 2010, 6:21am

Hi everyone,

What would be a fast way to subtract a vector V from every column of a matrix M using (preferably) CUBLAS?

I first tried constructing a matrix A with the same size as M and with each column equal to V, using cublasSetMatrix(). However, wouldn’t it be much faster if I could load V from host to device first, and then construct A? I can’t find a way to do that using CUBLAS. Any suggestions?

Many thanks,
Arno

avidday · April 27, 2010, 6:54am

You would be better off just writing a small compute kernel to do the operation directly in device memory. How large are the matrices and vectors?

Arno · April 27, 2010, 6:57am

The vector contains 2048 elements (floats) and the matrix is generally 2048 x 65536.

avidday · April 27, 2010, 7:06am

At that size, a dedicated kernel definitely makes sense. There is even some hope of being faster than doing it on the host.

Arno · April 27, 2010, 9:30am

Many thanks for your advice.

I’m obviously new to CUDA, so I’m not sure which solution would be fastest:

copy the vector V to device (global) memory; build a matrix A with each column equal to V using a kernel; subtract A from M,

or

copy one element of V to shared memory; make a thread block subtract this element from the corresponding row of matrix M; do this for all elements of V.

avidday · April 27, 2010, 9:54am

I definitely wouldn’t recommend option (1).

Option (2) should be perfect if your matrix is in row major order, but if you are working with CUBLAS I presume that isn’t the case? (natural ordering in CUBLAS is the FORTRAN convention).

If you have column major order, then the best approach is probably to have relatively small blocks of threads (say 32 threads per block) read a contiguous segment of V and store to shared memory, then have each thread traverse a row of M and subtract its element from each row entry. That will improve the memory access patterns and should allow coalesced global memory reads and writes. It should also give enough blocks to get reasonable occupancy of the GPU, which is good for hiding latency and keeping the instruction throughput up.

Arno · April 27, 2010, 1:04pm

It’s indeed in column major order. Thanks for the very helpful and clear answers! Now I can start coding…

Ayshachaudhry · July 27, 2017, 5:32am

#Arno can you share the code of subtract a vector from a matrix in cuda.