Using MPI/OpenMP with GPU for batched solution of multiple linear systems of equations

jah87 · April 4, 2013, 4:11am

I would like to solve multiple (~4-8) linear systems of equations using a single GPU, wherein each system is managed by a single OpenMP thread. The threads are spawned within the context of a higher-level MPI routine, so common routines utilizing ScaLAPACK, for example, are not an option. The problem sizes I am looking at are over order 100, so naive implementation with dense linear algebra libraries causes the execution time to slow down by a factor of 5-10.

With CUDA 5 and “Hyper-Q”, it was my hope that I would be able to see large improvement by virtue of the GPU now having a sufficiently large amount of computation to perform. However, this does not appear to be the case.

Are there existing linear algebra libraries which take advantage of shared memory/single GPU systems? The batched getrf functionality recently released in the developer zone would be perfect, but I am not sure how to merge this with multiple CPU threads. I would like to be able to delegate the work to be done to the GPU, then perform other operations with each CPU thread, if this is even possible.

It seems my biggest challenge is managing the communication/computation between threads and the GPU with OpenMP/CUDA. I have not been able to find any documentation on similar efforts, so if anyone could point me in that direction, I would be eternally grateful.

eyalhir74 · April 4, 2013, 6:03am

Hi,
Which libraries are you planning to use? Currently the only one that might show benefit from
Dynamic Parallelism/HyperQ/MPI would be NVidia’s cuBLAS as it supports streams and is async.
The others I’ve checked just very recently - Magma and CULA - don’t support streams very well
or not at all and even worse are doing a lot of Ping-Pong with the CPU, causing it not to be
safe for multi-threading/streams on the CPU side as well.

I think that two things should be done first:

find if the problem size is not already fully utilizing the GPU and hence no performance gains
from doing multiple solvers.
Run the profiler to see if your code/the library code you intend to use would be problematic
when it would be utilized with HyperQ/MPI/…

And if you find interesting stuff, would be great if you post your findings :)

Eyal

tera · April 4, 2013, 11:11am

I’m not sure whether this problem is sufficiently large to load a GPU to reasonable factors.
Since you are mentioning Hyper-Q I assume you have a GK110 device with 13 or 14 SMX. For a problem as small as 100x100 I agree the batched solver would be the best solution, because it is tuned for small problem sizes. Even if it’s called with just a single system to solve for each OpenMP thread. The parallelism will then come from overlapping the kernels for each OpenMP thread.
With a single block active for each of 4-8 OpenMP threads, only about a quarter or one half of the SMXes are used at all, and each of them is underutilized with only a single block running.

As reasonable expectations for GPU vs. CPU speedup are in the 5x-20x range, you are then down to about 1x to 4x in the best case. Which might still be a win if your CPU is doing other useful work in the meantime, but could easily be improved on if you had more than 4-8 systems to solve.

Topic		Replies	Views
OpenMP and CUDA Legacy PGI Compilers	5	4000	October 12, 2017
CUDA and OpenMP CUDA Programming and Performance	2	2833	May 20, 2008
MultiGPU, multithread, and establishing contexts Odd (but good) behavior with OpenMP affecting multi CUDA Programming and Performance	4	6242	July 10, 2009
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13567	May 29, 2021
Using GPUs on high performance machines CUDA Programming and Performance	4	1064	February 8, 2013
GPU based cluster CUDA Programming and Performance	2	725	November 25, 2015
Using multiple GPUs Legacy PGI Compilers	7	22091	August 11, 2009
Combining OpenMP and OpenACC Legacy PGI Compilers	4	6209	November 14, 2017
Cuda streams vs Cuda+MPI How the different CPU processes access to the GPU? CUDA Programming and Performance	13	15945	March 20, 2011
OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel CUDA Programming and Performance	11	3576	June 29, 2011

Using MPI/OpenMP with GPU for batched solution of multiple linear systems of equations

Related topics