GMRES, global Syncrhonization

I’m thinking about a gmres solver with cuda. Memory is going to be an issue for sure, but I will need to compute:

to solve:
Ax = b


b, Ab, AAb, AAAb, …

A first thought is that it would be nice to preload A into shared memory (perhaps some number of rows per block) and apply A first to b, then to Ab, etc… utilizing shared memory.

However, this would require a global synchronization to know when Ab is done for all blocks. I can image doing this with AtomicAdd and polling threads, but I think this is a bad idea, becuase it would mean all threads would need to stay resident until the operation completes…

Is this a horrible idea?

Also, I would be happy to hear from anyone thinking about a gmres solver and ideas…

Block synchronization is generally a bad idea. How large are these vectors? If they don’t entirely fit in shared memory, you’ll have to write partial results to global memory anyway, so you might as well make a kernel that does the operation:

A * y => y’

where y and y’ are vectors in global memory. After each call, swap the pointers.

Then you can initialize y with your b vector and call the kernel many times in a row. You can queue up many calls without having to explicitly synchronize, and the driver will run everything in sequence.

Yes, this seems to be the main point. I want to get away with large sparse matrices, so (with my card) since I have 8 multiprocessors with only 16k shared memory each I would get stuck pretty quickly.

I want to use CSR format for my matrix. A big goal is to use coalesced memory when accessing the matrix entries and the column indicies vector. I think coalesced memory is maybe hopeless for the x (in Ax), so I am now thinking about maybe using texture memory for the vector x, which, if the matrix is renumbered efficiently should have some locality for adjacent rows…

I think I have to give up on reusing shared memory for A, since there is no way A will fit fully in the shared memory on the card. Full coalescence seems to be the best hope…