Suggestion and advice on porting a linear algebra intensive subroutine to cuda

Hi, all:

I’m planning to port a linear algebra intensive subroutine/algorithm to cuda fortran. The subroutine has many matrix multiplications with large dimensions, matrix vector multiplications, many matrix inversions, one eigen value and vector solution and solving of a huge linear equation system in the end. Currently I’m planning to convert that subroutine entirely to a device kernel entirely to save time on data transfers between host and device. My main concern with this approach is whether cublas and cusolver libraries are callable from/within cuda fortran device code/kernel.

Of course I can also keep the subroutine as host code and launch cublas API calls in the subroutine. It will be much easier. My main concern with this approach is the overhead caused by data transfers between host and the device. The matrices involved in all operations are on the order of 3000x3000 (double complex), I don’t know how much data transfer overhead will be introduced for such data amount (roughly on the order of 150MB each matrix, and 3 matrices needed to be transferred on matrix multiplication)

Since I’m new to cuda fortran, any suggestion and advice will be appreciated.

Thanks in advance.

John

Hi John,

You don’t have to move data back and forth with each call. Both cublas and cusolver work on data already on the GPU, and leave it there when the call is complete. You should only have to move data to the device once at the beginning and bring it back once at the end.

Creating one big kernel to do it all has lots of drawbacks. Most of the math library functionality you want is not currently available on the device. And, you will use lots of resources passing arguments, doing little scalar operations, etc. which are better done on the host.

  • Brent

Hi, Brent:

Thanks for your suggestion. This morning I also realized that I can leave the data on the device and make cuda library API calls and use cuda fortran loop directives or openACC to parallelize most of my code. I will go for this approach.

Thanks again for your advice.

John