[cuSOLVER] ERROR using cuSolverDnXtrtri in FORTRAN

Hi, I’m trying to convert a call to the LAPACK function Dtrtri into cuSolverDnXtrtri to utilize the GPU. As I learned from an old topic, the “X” version is the only one available since CUDA 11.4 onwards (I’m using CUDA 11.8).

The issue is that I keep getting an error status returned from the function. According to the documentation, this should indicate that A(i,i), with info=i, is 0. But I’m sure it is not.

I believe the problem lies in the datatype argument (I struggled a lot to satisfy the compiler with it).

Here is an example of how I use the function:

  use cublas
  use cusolverdn
  use cudafor

  implicit none

  type(cusolverDnHandle)        :: handle
  integer                       :: n,istat
  integer(8)                    :: dlwork, hlwork
  integer(1),allocatable        :: hwork
  integer(1),allocatable,device :: dwork
  real(8),               device :: C(n,n)
  integer,               device :: istat_d


ISTAT=CUSOLVERDNXTRTRI_BUFFERSIZE(HANDLE,CUBLAS_FILL_MODE_UPPER,CUBLAS_DIAG_NON_UNIT,N,&
                                 &CUDADATATYPE(CUDA_R_64F),C,N,DLWORK,HLWORK)

ALLOCATE(DWORK(DLWORK),STAT=ISTAT)
ALLOCATE(HWORK(HLWORK),STAT=ISTAT)

ISTAT=CUSOLVERDNXTRTRI(HANDLE,CUBLAS_FILL_MODE_UPPER,CUBLAS_DIAG_NON_UNIT,N,&
                      &CUDADATATYPE(CUDA_R_64F),C,N,DWORK,DLWORK,HWORK,HLWORK,ISTAT_d)

It is a little confusing, and I admit our use of the cudaDataType is not consistent in the libraries. It is on our list to clean that up.

Here, I think, is a working program for cusolverDnXtrtri from Fortran. It uses real(4), but should be the same for real(8) with a change in data types. Yes, the function does return 0, which after I reread the documentation seems like success.

program main
use cudafor
use cusolverDn

integer, parameter :: n = 8
integer :: istat, lda
type(cusolverDnHandle) :: handle
real(4), device, allocatable, dimension(:,:) :: A
real(4) :: hv(n), ha(n,n), hi(n,n), hh(n,n)
integer(4), device :: devinfo
integer(8), value :: workspaceInBytesOnDevice, workspaceInBytesOnHost
integer(1), allocatable, device :: bufferOnDevice(:)
integer(1), allocatable :: bufferOnHost(:)

lda = n
allocate(a(lda,n))
!$cuf kernel do(2) <<<,>>>
do j = 1, n
do i = 1, n
a(i,j) = 0.0
if (i .eq. j) a(i,i) = 100.0
if (j .gt. i) a(i,i) = 1.0
end do
end do

ha = a
istat = cusolverDnCreate(handle)
if (istat .ne. CUSOLVER_STATUS_SUCCESS) then
print *, ‘handle creation failed’
print *,“Test FAILED”
endif

istat = cusolverDnXtrtri_bufferSize(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_DIAG_NON_UNIT, n, cudaDataType(CUDA_R_32F), A, lda, workspaceInBytesOnDevice, workspaceInBytesOnHost)
if (istat .ne. CUSOLVER_STATUS_SUCCESS) then
print *, ‘cusolverDnXtrtri_buffersize failed’
print *,“Test FAILED”
endif

print *,"workspaceInBytesOnDevice: ",workspaceInBytesOnDevice
if (workspaceInBytesOnDevice .le.0) workspaceInBytesOnDevice = 1
allocate(bufferOnDevice(workspaceInBytesOnDevice))
print *,"workspaceInBytesOnHost: ",workspaceInBytesOnHost
if (workspaceInBytesOnHost .le.0) workspaceInBytesOnHost = 1
allocate(bufferOnHost(workspaceInBytesOnHost))

istat = cusolverDnXtrtri(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_DIAG_NON_UNIT, n, cudaDataType(CUDA_R_32F), A, lda, bufferOnDevice,workspaceInBytesOnDevice, bufferOnHost, workspaceInBytesOnHost, devinfo)
if (istat .ne. CUSOLVER_STATUS_SUCCESS) then
print *, ‘cusolverDnXtrtri failed’
print *,“Test FAILED”
endif

istat = cusolverDnDestroy(handle)
if (istat .ne. CUSOLVER_STATUS_SUCCESS) then
print *, ‘handle destruction failed’
print *,“Test FAILED”
endif

istat = devinfo
if (istat .ne. CUSOLVER_STATUS_SUCCESS) then
print *,"devinfo : ",istat
end if

hi = a
print *,“Inverse of A”
do i = 1, n
write(6,‘(8(1x,f9.4))’) ha(i,:)
end do
print *,“A times inverse A”
hh = matmul(ha, hi)
do i = 1, n
write(6,‘(8(1x,f9.4))’) hh(i,:)
end do

end program

Thank you for the response.

The test program works fine, even after changing the data type. However, I am struggling to understand the difference between this program and what I am doing in my own program.

The only difference that comes to my mind is the dimension of the matrix involved. The case I am studying is about n=500.

When I tried using a smaller matrix (n=9), the cuSolverDnXtrtri function did not produce any error. However, the result is incorrect, as when I carry out a diagonalization procedure later in the code, it results in an error. If I stick to using the LAPACK Dtrtri function, the code works fine. So, the issue might be with the trtri function of cusolver. Unfortunately, without the ability to see the code inside cusolver, it is quite difficult to identify the problem.

Do you have any suggestions or ideas on how to proceed?

N = 500 is not huge. If that is coming from your main application, I would store the array off to a data file. Then write a test using CUDA Fortran managed data. Read in A from the data file, call cusolverDnXtrtri to produce the GPU results, and call LAPACK dtrtri to produce the CPU results. If they do not match to roundoff tolerance, send me the code and data and I will file a bug, or at least get some input from the cusolver team.

Hi, I did a series of tests and actually using the cusolverdnXtrtri function in an external mini-program the result is correct for all the tests I did.

However, in my code for a specific case n=496 I still get an error of ISTAT = 7 (even if the 7th element in the diagonal is not 0).

The problem arise later when I try to copy back to the host devInfo or the array. I get somthing like this
0: copyout Memcpy (host=0x7fffebdc663c, dev=0x7fa587e00000, size=4) FAILED: 700(an illegal memory access was encountered)

Using a smaller matrix (n=248) or even bigger (n=744, n=992) is totally fine.

A very strange behaviour is that if I run my program on a different GPU (a T4) it magically works even for n=248.
[The previous set up was a V100S]

Update:
I managed to reproduce the error in the “external mini-program” however it appears randomly (not every time I run it).

I also managed to find a work-around for the problem in my main application by incresing the workspace allocation on the device by 4 times. This is something another developer did for a similar problem with MKL…

Is it possible that the error is caused by a memory pointer going out of bounds, and when the workspace memory is increased somehow the bounds are not reached?

I leave attached the files I used for the tests.

S_k.txt (4.7 MB)
External_TEST.f90.txt (5.2 KB)
EXPECTED_OUTPUT.txt (3.8 MB)