What would be a fast way to subtract a vector V from every column of a matrix M using (preferably) CUBLAS?
I first tried constructing a matrix A with the same size as M and with each column equal to V, using cublasSetMatrix(). However, wouldn’t it be much faster if I could load V from host to device first, and then construct A? I can’t find a way to do that using CUBLAS. Any suggestions?
I’m obviously new to CUDA, so I’m not sure which solution would be fastest:
copy the vector V to device (global) memory; build a matrix A with each column equal to V using a kernel; subtract A from M,
or
copy one element of V to shared memory; make a thread block subtract this element from the corresponding row of matrix M; do this for all elements of V.
Option (2) should be perfect if your matrix is in row major order, but if you are working with CUBLAS I presume that isn’t the case? (natural ordering in CUBLAS is the FORTRAN convention).
If you have column major order, then the best approach is probably to have relatively small blocks of threads (say 32 threads per block) read a contiguous segment of V and store to shared memory, then have each thread traverse a row of M and subtract its element from each row entry. That will improve the memory access patterns and should allow coalesced global memory reads and writes. It should also give enough blocks to get reasonable occupancy of the GPU, which is good for hiding latency and keeping the instruction throughput up.