I’m planning to port a linear algebra intensive subroutine/algorithm to cuda fortran. The subroutine has many matrix multiplications with large dimensions, matrix vector multiplications, many matrix inversions, one eigen value and vector solution and solving of a huge linear equation system in the end. Currently I’m planning to convert that subroutine entirely to a device kernel entirely to save time on data transfers between host and device. My main concern with this approach is whether cublas and cusolver libraries are callable from/within cuda fortran device code/kernel.
Of course I can also keep the subroutine as host code and launch cublas API calls in the subroutine. It will be much easier. My main concern with this approach is the overhead caused by data transfers between host and the device. The matrices involved in all operations are on the order of 3000x3000 (double complex), I don’t know how much data transfer overhead will be introduced for such data amount (roughly on the order of 150MB each matrix, and 3 matrices needed to be transferred on matrix multiplication)
Since I’m new to cuda fortran, any suggestion and advice will be appreciated.
Thanks in advance.