My application will need to compute direct products of a single vector (~700 float values) with 100,000 other vectors (same length, obviously), and do it very fast.
Can this be done efficiently on CUDA and, say, the GT200 cards? Or will this be constrained by memory accesses? I read that the computation needs to have high “compute/memory access” ratio to work well on CUDA.
What other pitfalls are there to look for in a computation like that?
Any advice will be appreciated!