I am getting into porting code to the gpu using the openacc paradigm. Often my problems contain many calls to lapack FROM the gpu. An example code snippet is:
! data copy statements here
!$acc parallel loop
A = …
b = …
!$acc end parallel loop
How should such a call to a lapack routine be handled (so FROM the gpu). I have seen that magma can do this, but is that really the right way? Is it not meant to solve a few very large problems instead of many (10^7) small problems (size of the order 4 to 24 or so).
Your help is appreciated.
It is technically possible, but there are some details that may or may not make it practical.
Are your calls to dgesv independent? i.e. a given loop iteration does not depend on the previous iteration? If each loop iteration dependent on the previous, it has to run in serial.
that issue aside, you could call the cuSolver library’s dgesv(cusolverDnDDgesv). The data would first have to be moved to the GPU, or allocated there. presumably you’d add the appropriate OpenACC directives to do this. There is a blog post about how you’d access the device addresses to pass to cuSolver function (here).
cuSolver has a C-API, one could potentially use Fortran’s ISO-C bindings feature to call it.
In terms of getting good performance on many small operations one might use CUDA streams to overlap them. Presumably you’d need to create some number of streams and cycle through, telling cuSolver which stream to use at each loop iteration. This may require calling the CUDA API, I’m not sure if OpenACC can create streams. Also, of interest : for some BLAS functions NVIDIA has “batched” versions that improve the performance when doing many small operations (here). This may give you some other ideas about the issue.
Disclaimer: I’ve done a fair amount of CUDA programming but I’ve never used OpenACC and rarely touch Fortran so my advice may not be optimal. None the less,
hope this helps.
Thank you Burlen,
Indeed the solves are independent.
All data is on the GPU already. It’s only the calling that’s the issue.
I’ll look into cuSolver.
An alternative is to just take the lapack code and compile it in, possibly inlining all the routines.
I’ll also have a look at the batching.
What I find surprising is that there is no quick fix for this as this seems to be an issue faced by many.