I am trying to do a sparse matrix times sparse vector (resulting in a sparse vector). The specific question I have, is how does one maximize the gpu usage. Part of the operation consist of (i) a multiply of vector i (with n element) with a scalar alpha and (ii) a multiplication of a vector j (with m elements) with beta. How would one set up the data structure - grid/block/threads so you maximize the usage (I imagine that would involve simultaneously doing a portion of (i) and (ii), if the number of processors is not a multiple of n) of the gpu.
Any suggestions?