# Vector[1xN] * Matrix[NxM] How would you set it up ?

Hi All,

I need to multiply a huge vector V (say, it contains 150 000 000 floats) on matrix M, but the matrix is not a real matrix but a hardcoded formula F(row, column), so the actual value of F must be recalculated for each row and column. As the vector is really huge, formula becomes almost the only way of suitable matrix representation (150Mx150M matrix is something beyond the imagination).

Thanks in advance for any thoughts.

That’s huge enough that I would think about writing two kernels to compute just one element of the output array at a time. You’ll have to allocate three arrays: V, tempV, and outV (which will be M * V). Kernel #1 will compute V[i] * M[r, i] for a fixed parameter, r. The output will be stored in tempV. Then Kernel #2 does a parallel reduction on tempV and writes the result to outV[r].

Then you just have to call Kernel #1 followed by Kernel #2 for each value of r.

A variation to speed this up would be to have all the threads in each block do the parallel reduction in shared memory at the end of Kernel #1. Then Kernel #2 only has to do a parallel reduction over 150,000,000/[threads per block] elements.

BTW, I hope you’re getting one of the GT200-series Tesla cards, or plan to divide this job over multiple cards. With 150,000,000 floats in the input and output arrays, you already need 1.2 GB of CUDA memory. Actually, this is so many float point calculations, you should seriously look at the quad Tesla box, the FASTRA, or stuffing a few GTX 280s into a computer. :)

Oh wait, I’m being stupid about the memory thing. outV doesn’t need to have all the elements, actually. You can just stuff however many you can fit given your memory constraints, and then copy them out and start filling the outV array at the beginning again.

Similarly, using the reduction in shared memory trick to reduce the size of tempV, you can probably make this entire task fit onto one GTX 280.

Yeah, this fits into a single GTX280. OK, thanks for the advice!