what do you think would be right direction in implementing cg solver with cuda?
would it be enough to upload A and C (preconditioner) to GPU once at start, on every iteration upload vectors, do the multiplication and download the result, letting CPU do the rest of work - this would treat GPU as nothing more as matrix-vector multiplication coprocessor.
Or maybe I should move, in some way, the whole algorith to GPU?
First solutions sounds easy to implement, but I don’t know if multiplication alone would amortize the CPU-GPU-CPU transfers. Second solutions sounds harder and I don’t have a clear idea how to do it right now.
Please share your thoughts - what would be the right strategy to do it? thx
(I can’t estimate things myself right now since I only have access to g80 at university - so I can only do theoretical investigation)