Using MPI/OpenMP with GPU for batched solution of multiple linear systems of equations

I would like to solve multiple (~4-8) linear systems of equations using a single GPU, wherein each system is managed by a single OpenMP thread. The threads are spawned within the context of a higher-level MPI routine, so common routines utilizing ScaLAPACK, for example, are not an option. The problem sizes I am looking at are over order 100, so naive implementation with dense linear algebra libraries causes the execution time to slow down by a factor of 5-10.

With CUDA 5 and “Hyper-Q”, it was my hope that I would be able to see large improvement by virtue of the GPU now having a sufficiently large amount of computation to perform. However, this does not appear to be the case.

Are there existing linear algebra libraries which take advantage of shared memory/single GPU systems? The batched getrf functionality recently released in the developer zone would be perfect, but I am not sure how to merge this with multiple CPU threads. I would like to be able to delegate the work to be done to the GPU, then perform other operations with each CPU thread, if this is even possible.

It seems my biggest challenge is managing the communication/computation between threads and the GPU with OpenMP/CUDA. I have not been able to find any documentation on similar efforts, so if anyone could point me in that direction, I would be eternally grateful.

Which libraries are you planning to use? Currently the only one that might show benefit from
Dynamic Parallelism/HyperQ/MPI would be NVidia’s cuBLAS as it supports streams and is async.
The others I’ve checked just very recently - Magma and CULA - don’t support streams very well
or not at all and even worse are doing a lot of Ping-Pong with the CPU, causing it not to be
safe for multi-threading/streams on the CPU side as well.

I think that two things should be done first:

  • find if the problem size is not already fully utilizing the GPU and hence no performance gains
    from doing multiple solvers.

  • Run the profiler to see if your code/the library code you intend to use would be problematic
    when it would be utilized with HyperQ/MPI/…

    And if you find interesting stuff, would be great if you post your findings :)


I’m not sure whether this problem is sufficiently large to load a GPU to reasonable factors.
Since you are mentioning Hyper-Q I assume you have a GK110 device with 13 or 14 SMX. For a problem as small as 100x100 I agree the batched solver would be the best solution, because it is tuned for small problem sizes. Even if it’s called with just a single system to solve for each OpenMP thread. The parallelism will then come from overlapping the kernels for each OpenMP thread.
With a single block active for each of 4-8 OpenMP threads, only about a quarter or one half of the SMXes are used at all, and each of them is underutilized with only a single block running.

As reasonable expectations for GPU vs. CPU speedup are in the 5x-20x range, you are then down to about 1x to 4x in the best case. Which might still be a win if your CPU is doing other useful work in the meantime, but could easily be improved on if you had more than 4-8 systems to solve.