Memory space used by cusolverDN_getrf

jo_pink · May 29, 2021, 12:58pm

I solve linear hydrodynamic problems usin cusolver on a gtx 1080 ti . With 11 gb of memory on the card I should be able to solver linear problems with about 36000 unknowns and have done so in the past. Recently I discovered that I could go no further than about 24000 unknowns. After that I get memory errors.
What I found was that the LWORK variable (number of memory locations used by the solver ABOVE the size of the matrix being inverted) was very high (mxn). In the past I had much lower LWORK values. I seem to have updated CUDA and not noticed that the solver is set at its highest performance regarding speed while I need maximum problem size and can accept some speed loss. There seems to be a method using cusolverDnSetAdvOptions to get the solver to minimize memory use. I have not been able to find examples of how that is used . Can somebody show me some Fortran code examples of the application of this option. Thanks a million

Robert_Crovella · May 30, 2021, 4:16am

From here:

Remark: getrf uses fastest implementation with large workspace of size m*n. The user can choose the legacy implementation with minimal workspace by Getrf and cusolverDnSetAdvOptions(params, CUSOLVERDN_GETRF, CUSOLVER_ALG_1).

You have to use the 64-bit API.

jo_pink · May 30, 2021, 9:07am

Thank you for your quick reply Robert. I have read that but not being a professional programmer (I am a naval architect solving hydrodynamic problems) this does not get me help me . For instance, what are ‘params’ , where do I get that information . What I need is an example in the form of a few lines of Fortran code which contains relevant declarations and statements showing where the cusolverDnSetAdvOptions statement comes . This is the code as it is now :
nig4 x nig4 : size of matrix G4 of single precision complex numbers

  STATUS = CUSOLVERDNCREATE(HANDLE)

  STATUS = CUSOLVERDNCGETRF_BUFFERSIZE(HANDLE,NIG4,NIG4,
 *  G4,NIG4,LWORK) 
  PRINT *, 'BUFFER SIZE STATUS ', STATUS
  PRINT*,'LWORK',LWORK
  ALLOCATE(WORKSPACE_D(LWORK))

!$ACC DATA CREATE(WORKSPACE_D, DEVIPIV, DEVINFO)
!$ACC HOST_DATA USE_DEVICE(DEVIPIV, WORKSPACE_D, DEVINFO)
STATUS = CUSOLVERDNCGETRF(HANDLE, NIG4, NIG4, G4, NIG4,
* WORKSPACE_D, DEVIPIV, DEVINFO)
STATUS = CUSOLVERDNCGETRS(HANDLE, CUBLAS_OP_N, NIG4,
* NRHS, G4, NIG4, DEVIPIV, RL, NIG4, DEVINFO)
! PRINT*,'CUSOLVERDNCGETRS STATUS = ',STATUS
!$ACC END HOST_DATA
!$ACC END DATA
STATUS = CUSOLVERDNDESTROY(HANDLE)
PRINT*,'CUSOLVERDNDESTROY STATUS = ',STATUS

Elsewhere I have a module containing interfaces for cusolverDnCreate, cusolverDnDestroy, cusolverDnCgetrf_bufferSize, cusolverDnCgetrf, cusolverDnCgetrs and get_dev_mem

This module was put together based on the kind directions of the engineers from PGI. All works fine except for the present problem of the large memory requirement for cusolverDnCgetrf.

Hope this helps,

Regards,

Jo

jo_pink · July 5, 2021, 5:31pm

Hi all,

I managed to overcome the memory problem outlined above by swapping a pair of dll’s from CUDA10.1 for the dll’s with the same name from CUDA11.2 which I have been using .
It concerns the following dll’s:
cudart64_101.dll
cusolver64_10.dll

Using the dll’s from CUDA10.1 brought the available memory right back up and the LWORK value dropped down to 14 instead of m x n being the size of the matrix going into cusolverDnCGetrf
I hope that NVIDIA includes the option to easily swap the fast solver for the same solver that is a bit slower but makes more memory available for larger problems.

Regards and thanks ,

Jo

mnicely · July 6, 2021, 12:42pm

cuSOLVER does provide the option to use the legacy algorithms.
Per Robert’s instructions above, you must past the correct parameter through cusolverDnSetAdvOptions

jo_pink · July 6, 2021, 3:03pm

Hi mnicely,

Thanks for your comment. I am aware that the cusolverDnSetAdvOptions is the correct way to go but , as I mentioned in a previous note (above) , I have problems filling in the proper parameters . I tried a few times but get compile errors so I gave that up and gave preference to the ‘fix’ shown above. This is what has happened to me before ; I put a question and get a quick answer but it is often clearly aimed at professionals who know what to do to make it work. I need short bits of code which will do the job and show the relationship between the various calls. In this case application of cusolverDnSetAdvOptions (params, function,algo) needs to be completed, the parameters declared and then put in the proper location relative to the call cusolverDnCgetrf( …). What I clearly miss in NVIDIA documentation are bits of code showing practical applications of such calls. If you can recommend some suitable literature I would be grateful. I have the book 'Parallel programming with OpenAcc which has a lot of examples but unfortunately , it is almost all in c++ and little in Fortran. This particular problem is not treated in the book either.

If you can help me along that would be great,
Best regards and thanks,

Jo

mnicely · July 6, 2021, 3:11pm

Best book for beginners - https://developer.download.nvidia.com/books/cuda-by-example/cuda-by-example-sample.pdf

I’ll see if I can’t find a Fortran example using cusolverDnSetAdvOptions

jo_pink · July 9, 2021, 10:39am

Thanks for the references !

The Fortran examples should help met out.

Thanks ,

Jo

bleback · July 9, 2021, 4:35pm

Hi Jo, here is a CUDA Fortran version. It could be easily modified for OpenACC or other models.

program testdgetrf
use cutensorex
use cusolverDn
use cudafor
integer, parameter :: M = 8000
integer, parameter :: N = M
real(8), parameter :: eps = 1.0d-7
real(8), managed :: a(M,N), b(M,128)
real(8), managed :: bscal(128)
real(8) adiff, bval, t0, t1
integer(8), managed :: ipiv(M)
integer(4), managed :: fsinfo(2)
integer(4) lda
integer(8) devsz, hostsz
type(cusolverDnHandle) :: h
type(cusolverDnParams) :: p
real(8), device, allocatable :: dwork(:)
real(8), allocatable :: hwork(:)

call random_number(a)
!$cuf kernel do(2) <<< *,* >>>
do j = 1, n
  do i = 1, n
    if (abs(i-j) .lt. 4) then
      a(i,j) = a(i,j) * 10.0d0
    end if
  end do
end do
!
b(:,1) = sum(a,dim=2)
call random_number(bscal)
!$cuf kernel do(2)<<< *,* >>>
do j = 2, 128
  do i = 1, M
    b(i,j) = b(i,1) * bscal(j)
  end do
end do
!
lda = m

istat = cusolverDnCreate(h)
print *,"cusolver handle create status = ",istat

istat = cusolverDnCreateParams(p)
print *,"cusolver create params status = ",istat

istat = cusolverDnSetAdvOptions(p, CUSOLVERDN_GETRF, CUSOLVER_ALG_1)
print *,"cusolver set adv options status = ",istat

istat = cusolverDnXgetrf_buffersize(h, p, m, n, cudaDataType(CUDA_R_64F), &
            a, lda, cudaDataType(CUDA_R_64F), devsz, hostsz )

print *,"cusolver Xgetrf buffersize status = ",istat
print *,"cusolver Xgetrf buffersize dev, host size = ", devsz, hostsz

allocate(dwork(devsz/8))
allocate(hwork(hostsz/8))

call cpu_time(t0)
istat = cusolverDnXgetrf(h, p, m, n, cudaDataType(CUDA_R_64F), a, lda, &
            ipiv, cudaDataType(CUDA_R_64F), dwork, devsz, hwork, hostsz, fsinfo(1))

jstat = cusolverDnXgetrs(h, p, CUBLAS_OP_N, n, 128, cudaDataType(CUDA_R_64F), &
            a, lda, ipiv, cudaDataType(CUDA_R_64F), b, lda, fsinfo(2))
istat = cudaDeviceSynchronize()
call cpu_time(t1)
print *,"dgetrf return",istat, fsinfo(1)
print *,"dgetrs return",jstat, fsinfo(2)
!
nerrors = 0
bscal(1) = 1.0
do i = 1, 128
  adiff = abs(minval(b(:,i))-bscal(i)) + abs(maxval(b(:,i))-bscal(i))
  if (adiff.gt.eps) then
    nerrors = nerrors + 1
    write (6,100) minval(b(:,i)), maxval(b(:,i)), bscal(i)
  endif
end do
100 format(10(1x,f12.8))
if (nerrors.eq.0) print *,"test PASSED"
print *,"Time for dgetrf and dgetrs using cpu_time: ",t1-t0
end

I compiled it with 21.5 like this: nvfortran tdgetrf.cuf -cudalib=cutensor,curand,cusolver

jo_pink · July 9, 2021, 7:35pm

Thanks bleback for the sample code !
I will make best effort to get this working for my case. I will get back when I have worked the whole thing out. I won’t make a guess as to how long that will take.

Thanks again,

Jo

Topic		Replies	Views
cusolverDnSgetrf() fails on A100 (but not on A10) when called in a tight loop GPU-Accelerated Libraries cusolver	15	1889	February 23, 2022
Using cuSolverDN in FORTRAN code GPU-Accelerated Libraries	8	4279	November 18, 2015
CUDA 11.7 undefined reference to cusolverDnDtrtri nvc, nvc++ and nvfortran	10	1164	October 13, 2022
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1668	October 12, 2021
[cuSOLVER] ERROR using cuSolverDnXtrtri in FORTRAN nvc, nvc++ and nvfortran	4	666	September 20, 2023
CUDA memory release Jetson Nano	14	6116	October 14, 2021
cudaMemcpyAsync execution before and after Level 1 cuBLAS kernel calls nvc, nvc++ and nvfortran cuda	7	109	October 29, 2024
Running cuSolver example GPU-Accelerated Libraries	3	5452	June 8, 2015
Issues using cusolver with fortran nvc, nvc++ and nvfortran	5	870	June 24, 2021
Nvfortran error nvc, nvc++ and nvfortran	39	3384	January 17, 2024

Memory space used by cusolverDN_getrf

Related topics