Using cusparseDgtsv2_nopivot() with OpenACC in Fortran code

Hello,
I have written a test code that calls cusparseDgtsv2_nopivot function in cusparse library to solve tridiagonal matrix, in which openacc is used for data transmission, mainly referring to tcusparse3.f90.
Here is my program

PROGRAM TDMA
  use openacc
  use cusparse
  implicit none

  integer, parameter :: npts = 31
  integer :: cusparseCreate_status
  type(cusparseHandle) :: handle
  integer :: m, n, ldb
  real(8) :: dl(npts), d(npts), du(npts)
  real(8) :: B(npts)
  integer :: i
  integer :: istat
  integer(8) :: bufferSizeInBytes
  integer(1), pointer:: buffer(:)
  
  cusparseCreate_status = cusparseCreate(handle)
  !$acc data create(dl,d,du,B)
  m = npts
  n = 1
  ldb = npts
  dl = 1.0
  dl(1) = 0.0
  d = 2.0
  du = 1.0
  du(npts) = 0.0
  do i = 1, 16
    B(i) = i
    B(32 - i) = i
  end do
  !%acc update device(dl,d,du,B)
    
  print *, 'CREATE cusparseCreate_status: '
  if (cusparseCreate_status == CUSPARSE_STATUS_SUCCESS) then
    print *, 'CUSPARSE_STATUS_SUCCESS'
  elseif (cusparseCreate_status == CUSPARSE_STATUS_NOT_INITIALIZED) then
    print *, 'CUSPARSE_STATUS_NOT_INITIALIZED'
  elseif (cusparseCreate_status == CUSPARSE_STATUS_ALLOC_FAILED) then
    print *, 'CUSPARSE_STATUS_ALLOC_FAILED'
  elseif (cusparseCreate_status == CUSPARSE_STATUS_ARCH_MISMATCH) then
    print *, 'CUSPARSE_STATUS_ARCH_MISMATCHED'
  end if

  istat = cusparseDgtsv2_nopivot_buffersizeext(handle, m, n, dl, d, du, B, ldb, bufferSizeInBytes)
  allocate(buffer(bufferSizeInBytes))
  istat = cusparseDgtsv2_nopivot(handle, m, n, dl, d, du, B, ldb, buffer)
    print *, 'Dgtsv STATUS: '
  if (istat == CUSPARSE_STATUS_SUCCESS) then
    print *, 'CUSPARSE_STATUS_SUCCESS'
  elseif (istat == CUSPARSE_STATUS_NOT_INITIALIZED) then
    print *, 'CUSPARSE_STATUS_NOT_INITIALIZED'
  elseif (istat == CUSPARSE_STATUS_ALLOC_FAILED) then
    print *, 'CUSPARSE_STATUS_ALLOC_FAILED'
  elseif (istat == CUSPARSE_STATUS_INVALID_VALUE) then
    print *, 'CUSPARSE_STATUS_INVALID_VALUE'
  elseif (istat == CUSPARSE_STATUS_ARCH_MISMATCH) then
    print *, 'CUSPARSE_STATUS_ARCH_MISMATCHED'
  elseif (istat == CUSPARSE_STATUS_EXECUTION_FAILED) then
    print *, 'CUSPARSE_STATUS_EXECUTION_FAILED'
  elseif (istat == CUSPARSE_STATUS_INTERNAL_ERROR) then
    print *, 'CUSPARSE_STATUS_INTERNAL_ERROR'
  end if
  
  !$acc update host(dl,d,du,B)
  !$acc end data
  
  print *, 'The solution is: '
  do i = 1, npts
    print *, 'SOL(', i, '):', B(i)
  end do
END PROGRAM TDMA

After running, it displays:

nvfortran -Mpreprocess -fast -acc=gpu -cudalib=cusparse -o gtsv2acc.exe gtsv2acc.f90
./gtsv2acc.exe
 CREATE cusparseCreate_status:
 CUSPARSE_STATUS_SUCCESS
 Dgtsv STATUS:
 CUSPARSE_STATUS_SUCCESS
Failing in Thread:1
Accelerator Fatal Error: call to cuMemcpyDtoHAsync returned error 700: Illegal address during kernel execution
 File: /home/lixinyu/5555/5555/gtsv2acc.f90
 Function: tdma:1
 Line: 64

make: *** [makefile:15: run] Error 1

The result shows that the cusparseDgtsv2_nopivot function seems to have been called successfully, but I am still not sure how to transfer the result of the function to the host and print it.
If you have any other recommendation for the code, it would be helpful.
Looking forward your replies,thanks!

I believe the problem here is that “buffer” needs to be a device array.

To fix, create a device copy of buffer:

  allocate(buffer(bufferSizeInBytes))
!$acc data create(buffer)
  istat = cusparseDgtsv2_nopivot(handle, m, n, dl, d, du, B, ldb, buffer)
!$acc end data

or you can add the CUDA Fortran “device” attribute to the declaration of “buffer” and then add “-cuda” to your compilation flags. I prefer this method since there no need to have a host copy of this array, For example:

% grep "device " tdma.f90
  integer(1), pointer, device :: buffer(:)
% nvfortran -acc -cudalib=cusparse -fast tdma.f90 -cuda ; a.out
 CREATE cusparseCreate_status:
 CUSPARSE_STATUS_SUCCESS
 Dgtsv STATUS:
 CUSPARSE_STATUS_SUCCESS
 The solution is:
 SOL(            1 ):    0.000000000000000
 SOL(            2 ):    1.000000000000000
 SOL(            3 ):    0.000000000000000
 SOL(            4 ):    2.000000000000000
 SOL(            5 ):    0.000000000000000
 SOL(            6 ):    3.000000000000000
 SOL(            7 ):    0.000000000000000
 SOL(            8 ):    4.000000000000000
 SOL(            9 ):    0.000000000000000
 SOL(           10 ):    5.000000000000000
 SOL(           11 ):    0.000000000000000
 SOL(           12 ):    5.999999999999999
 SOL(           13 ):    0.000000000000000
 SOL(           14 ):    7.000000000000000
 SOL(           15 ):    0.000000000000000
 SOL(           16 ):    8.000000000000000
 SOL(           17 ):    0.000000000000000
 SOL(           18 ):    7.000000000000000
 SOL(           19 ):    0.000000000000000
 SOL(           20 ):    5.999999999999999
 SOL(           21 ):    0.000000000000000
 SOL(           22 ):    5.000000000000000
 SOL(           23 ):    0.000000000000000
 SOL(           24 ):    4.000000000000000
 SOL(           25 ):    0.000000000000000
 SOL(           26 ):    3.000000000000000
 SOL(           27 ):    0.000000000000000
 SOL(           28 ):    2.000000000000000
 SOL(           29 ):    0.000000000000000
 SOL(           30 ):    1.000000000000000
 SOL(           31 ):    0.000000000000000

Hope this helps,
Mat

Hi, Mat!@MatColgrove
Thanks for your raply, I have corrected my program in the two ways you mentioned. It can run, but can’t output the correct result, only “NaN”.

nvfortran -Mpreprocess -fast -acc=gpu -cudalib=cusparse -cuda -o gtsv2acc.exe gtsv2acc.f90
./gtsv2acc.exe
 CREATE cusparseCreate_status:
 CUSPARSE_STATUS_SUCCESS
 Dgtsv STATUS:
 CUSPARSE_STATUS_SUCCESS
 The solution is:
 SOL(            1 ):                       NaN
 SOL(            2 ):                       NaN
 SOL(            3 ):                       NaN
 SOL(            4 ):                       NaN
 SOL(            5 ):                       NaN
 SOL(            6 ):                       NaN
 SOL(            7 ):                       NaN
 SOL(            8 ):                       NaN
 SOL(            9 ):                       NaN
 SOL(           10 ):                       NaN
 SOL(           11 ):                       NaN
 SOL(           12 ):                       NaN
 SOL(           13 ):                       NaN
 SOL(           14 ):                       NaN
 SOL(           15 ):                       NaN
 SOL(           16 ):                       NaN
 SOL(           17 ):                       NaN
 SOL(           18 ):                       NaN
 SOL(           19 ):                       NaN
 SOL(           20 ):                       NaN
 SOL(           21 ):                       NaN
 SOL(           22 ):                       NaN
 SOL(           23 ):                       NaN
 SOL(           24 ):                       NaN
 SOL(           25 ):                       NaN
 SOL(           26 ):                       NaN
 SOL(           27 ):                       NaN
 SOL(           28 ):                       NaN
 SOL(           29 ):                       NaN
 SOL(           30 ):                       NaN
 SOL(           31 ):                       NaN

Could you point out the mistake for me? Thank you so much!

Hmm, I went back and tried the code on a variety of devices and compiler versions, but all get what appear to be valid results, i.e. no NaNs.

What nvfortran version are using? What device and CUDA driver? What OS?

@MatColgrove Thanks for your raply.
My nvfortran version is 23.3, cuda driver version is 525.85.05, OS is Ubuntu 20.04.2 LTS. GPU version is Tesla V100-SXM2-16GB, thanks.

lixinyu@featurize:~$ nvfortran -V
nvfortran 23.3-0 64-bit target on x86-64 Linux -tp skylake-avx512
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

lixinyu@featurize:~$ nvidia-smi
Sat Nov 11 09:29:04 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |

lixinyu@featurize:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

@MatColgrove Hoooo so sorry Mat, I have found my mistake. One ‘! $acc’ was written by me incorrectly ‘! %acc’. When I corrected it, I was able to output the correct result.
So careless I am. I’m sorry to have wasted your time. Thank you so much!

No worries and my apologies, I fixed this early but then was focused on the issue with buffer so missed letting you know about it.