I would like to solve multiple (~4-8) linear systems of equations using a single GPU, wherein each system is managed by a single OpenMP thread. The threads are spawned within the context of a higher-level MPI routine, so common routines utilizing ScaLAPACK, for example, are not an option. The problem sizes I am looking at are over order 100, so naive implementation with dense linear algebra libraries causes the execution time to slow down by a factor of 5-10.
With CUDA 5 and “Hyper-Q”, it was my hope that I would be able to see large improvement by virtue of the GPU now having a sufficiently large amount of computation to perform. However, this does not appear to be the case.
Are there existing linear algebra libraries which take advantage of shared memory/single GPU systems? The batched getrf functionality recently released in the developer zone would be perfect, but I am not sure how to merge this with multiple CPU threads. I would like to be able to delegate the work to be done to the GPU, then perform other operations with each CPU thread, if this is even possible.
It seems my biggest challenge is managing the communication/computation between threads and the GPU with OpenMP/CUDA. I have not been able to find any documentation on similar efforts, so if anyone could point me in that direction, I would be eternally grateful.