How to solve a tridiagonal matrix using the cusparse<t>gtsv2_nopivot() functions in the cusparse library

Hello,
I am a cusparse beginner and want to call the functions in the cusparse library to solve the tridiagonal matrix problem. Here is a program I wrote with reference to forum users’ code, The output of the program is not the solution of the matrix, but the value originally assigned to the B vector. How do I solve this problem?
Thank you very much!

PROGRAM TDMA
	
  use iso_C_binding
  use cudafor
  use cusparse
  implicit none

  ! atri - sub-diagonal (means it is the diagonal below the main diagonal)
  ! btri - the main diagonal
  ! ctri - sup-diagonal (means it is the diagonal above the main diagonal)
  ! dtri - right part
  ! npts - number of equations
 
  integer(c_int), parameter :: npts = 31
  real(c_double), dimension(npts) :: atri, btri, ctri, dtri
  integer(c_int) :: cusparseCreate_status
  type(cusparseHandle) :: handle
  integer(c_int) :: m, n, ldb
  real(c_double), dimension(npts) :: dl, d, du
  real(c_double), dimension(npts, 1) :: B
  integer(c_int) :: i
  integer(c_size_t), target :: bufferSizeInBytes
  integer(c_int) :: istat
  integer(c_int), dimension(:), allocatable :: buffer

  atri = 1.0
  atri(1) = 0.0
  btri = 2.0
  ctri = 1.0
  ctri(npts) = 0.0

  do i = 1, 31
    dtri(i) = i 
  end do

  do i = 1, npts
    B(i, 1) = dtri(i)
  end do

  dl  = atri
  d   = btri
  du  = ctri
  m   = npts
  n   = 1
  ldb = 1

  cusparseCreate_status = cusparseCreate(handle)

  print *, 'CREATE cusparseCreate_status: '
  if (cusparseCreate_status == CUSPARSE_STATUS_SUCCESS) then
    print *, 'CUSPARSE_STATUS_SUCCESS'
  elseif (cusparseCreate_status == CUSPARSE_STATUS_NOT_INITIALIZED) then
    print *, 'CUSPARSE_STATUS_NOT_INITIALIZED'
  elseif (cusparseCreate_status == CUSPARSE_STATUS_ALLOC_FAILED) then
    print *, 'CUSPARSE_STATUS_ALLOC_FAILED'
  elseif (cusparseCreate_status == CUSPARSE_STATUS_ARCH_MISMATCH) then
    print *, 'CUSPARSE_STATUS_ARCH_MISMATCHED'
  end if

  istat = cusparseDgtsv2_nopivot_buffersizeext(handle, m, n, dl, d, du, B, ldb, bufferSizeInBytes)
  allocate(buffer(bufferSizeInBytes / c_sizeof(buffer(1))))

  istat = cusparseDgtsv2_nopivot(handle, m, n, dl, d, du, B, ldb, buffer)

  print *, 'Dgtsv STATUS: '
  if (istat == CUSPARSE_STATUS_SUCCESS) then
    print *, 'CUSPARSE_STATUS_SUCCESS'
  elseif (istat == CUSPARSE_STATUS_NOT_INITIALIZED) then
    print *, 'CUSPARSE_STATUS_NOT_INITIALIZED'
  elseif (istat == CUSPARSE_STATUS_ALLOC_FAILED) then
    print *, 'CUSPARSE_STATUS_ALLOC_FAILED'
  elseif (istat == CUSPARSE_STATUS_INVALID_VALUE) then
    print *, 'CUSPARSE_STATUS_INVALID_VALUE'
  elseif (istat == CUSPARSE_STATUS_ARCH_MISMATCH) then
    print *, 'CUSPARSE_STATUS_ARCH_MISMATCHED'
  elseif (istat == CUSPARSE_STATUS_EXECUTION_FAILED) then
    print *, 'CUSPARSE_STATUS_EXECUTION_FAILED'
  elseif (istat == CUSPARSE_STATUS_INTERNAL_ERROR) then
    print *, 'CUSPARSE_STATUS_INTERNAL_ERROR'
  end if

  do i = 1, npts
    dtri(i) = B(i, 1)
  end do

  ! Printing solution
  print *, 'The solution is: '
  ! Expected Solution = transpose of (0,1,0,2,0,3,...,0,8,0,7,0,6,...,0,1,0)
  do i = 1, npts
    print *, 'SOL(', i, '):', dtri(i)
  end do
END PROGRAM TDMA

Hello @lxyonline2887 and welcome to the NVIDIA developer forums!

I took the liberty of moving your post to the CUDA Programming category, I am sure you will find some support there.

Thanks!

@MarkusHoHo Why would this sub-forum be more suitable than the sub-forum dedicated to accelerated libraries including cuSPARSE?

Thank you so much!

this may be of interest

I haven’t studied your code carefully, but it seems fairly evident that you are unaware of the distinction between host and device memory when using a GPU. The cusparse function calls generally require the data to be already deposited in device memory, which you can learn by reading the documentation. Furthermore, when starting with host data, its not sufficient simply to use device allocations. There is usually some copying of data from host to device, and copying of results from device to host, in CUDA, and with typical CUDA library usage, including cusparse.

In Fortran/cudafor, such a device allocation would be evidenced by the device attribute on an allocation. That attribute doesn’t appear anywhere in your code and so all your data is in host memory. That won’t work, and is not how cusparse is intended to be used. You can find other examples of using CUDA library routines (in Fortran) here on these forums, that include proper use of the device attribute for the necessary allocations, here is one example. (Yes, I am aware that that is not an example of cusparse gtsv2.) I don’t have an example of the use of gtsv2 in CUDA fortran to point you to.

Thanks for your replies! I’ve figured out what my problem was. I will try to read those examples and correct my program.

Hi Robert!
I did some learning about array calls on host and device. I modified my program according to the example you provided. Here is my corrected procedure:

program TDMA
  use cudafor
  use cusparse
  implicit none

  integer,parameter :: npts = 31
  integer :: m=31
  integer :: n=1
  integer :: ldb=31
  integer :: m_d
  integer :: n_d
  integer :: ldb_d
  real(8), dimension(:),  allocatable :: dl, d, du
  real(8), allocatable, device, dimension(:) :: dl_d
  real(8), allocatable, device, dimension(:) :: d_d
  real(8), allocatable, device, dimension(:) :: du_d
  real(8), dimension(:), allocatable :: B
  real(8), allocatable,  device, dimension(:) :: B_d
  integer(8) :: i
  integer(8) :: istat
  Integer(8) :: bufferSize
  Integer(1), pointer, device :: buffer(:)

  integer :: cusparseCreate_status
  type(cusparseHandle) :: handle

  istat = cudaMalloc(dl_d, n * c_sizeof(1))
  istat = cudaMalloc(d_d, n * c_sizeof(1))
  istat = cudaMalloc(du_d, n * c_sizeof(1))
  istat = cudaMalloc(B_d, n * c_sizeof(1))
  
  ! Initialize your arrays and values
  ! Allocate host arrays
  allocate(dl(npts))
  allocate(d(npts))
  allocate(du(npts))
  allocate(B(npts))
  
  dl = 1.0
  dl(1) = 0.0
  d = 2.0
  du = 1.0
  du(npts) = 0.0

  do i = 1, 16
    B(i) = i
    B(32 - i) = i
  end do
  
  istat = cudaMemcpy(dl_d, dl, n * c_sizeof(1), cudaMemcpyHostToDevice)
  istat = cudaMemcpy(d_d, d, n * c_sizeof(1), cudaMemcpyHostToDevice)
  istat = cudaMemcpy(du_d, du, n * c_sizeof(1), cudaMemcpyHostToDevice)
  istat = cudaMemcpy(B_d, B, n * c_sizeof(1), cudaMemcpyHostToDevice)
  
  ! cusparse_create and check
  cusparseCreate_status = cusparseCreate(handle)

  print *, 'CREATE cusparseCreate_status: '
  if (cusparseCreate_status == CUSPARSE_STATUS_SUCCESS) then
    print *, 'CUSPARSE_STATUS_SUCCESS'
  elseif (cusparseCreate_status == CUSPARSE_STATUS_NOT_INITIALIZED) then
    print *, 'CUSPARSE_STATUS_NOT_INITIALIZED'
  elseif (cusparseCreate_status == CUSPARSE_STATUS_ALLOC_FAILED) then
    print *, 'CUSPARSE_STATUS_ALLOC_FAILED'
  elseif (cusparseCreate_status == CUSPARSE_STATUS_ARCH_MISMATCH) then
    print *, 'CUSPARSE_STATUS_ARCH_MISMATCHED'
  end if

  ! Calling Dgtsv and checking the output Status
  istat = cusparseDgtsv2_nopivot_buffersizeext(handle, m_d, n_d, dl_d, d_d, du_d, B_d, ldb_d, bufferSize)
  allocate(buffer(bufferSize / c_sizeof(buffer(1))))

  istat = cusparseDgtsv2_nopivot(handle, m_d, n_d, dl_d, d_d, du_d, B_d, ldb_d, buffer)
  
  write(*,*) m_d, n_d, dl_d, d_d, du_d, B_d, ldb_d
  print *, 'Dgtsv STATUS: '
  write(*,*) istat
  
  if (istat == CUSPARSE_STATUS_SUCCESS) then
    print *, 'CUSPARSE_STATUS_SUCCESS'
    
  elseif (istat == CUSPARSE_STATUS_NOT_INITIALIZED) then
    print *, 'CUSPARSE_STATUS_NOT_INITIALIZED'
  elseif (istat == CUSPARSE_STATUS_ALLOC_FAILED) then
    print *, 'CUSPARSE_STATUS_ALLOC_FAILED'
  elseif (istat == CUSPARSE_STATUS_INVALID_VALUE) then
    print *, 'CUSPARSE_STATUS_INVALID_VALUE'
  elseif (istat == CUSPARSE_STATUS_ARCH_MISMATCH) then
    print *, 'CUSPARSE_STATUS_ARCH_MISMATCHED'
  elseif (istat == CUSPARSE_STATUS_EXECUTION_FAILED) then
    print *, 'CUSPARSE_STATUS_EXECUTION_FAILED'
  elseif (istat == CUSPARSE_STATUS_INTERNAL_ERROR) then
    print *, 'CUSPARSE_STATUS_INTERNAL_ERROR'
  end if

  istat = cudaMemcpy(B, B_d, n * c_sizeof(1), cudaMemcpyDeviceToHost)

  print *, 'The solution is: '
  do i = 1, npts
    print *, 'SOL(', i, '):', B(i)
  end do
END PROGRAM TDMA

When “device” is present in the code, the program reports the following error:
NVFORTRAN-S-0034-Syntax error at or near device (cuda_tdma.f90: 10)
NVFORTRAN-S-0034-Syntax error at or near device (cuda_tdma.f90: 11)
NVFORTRAN-S-0034-Syntax error at or near device (cuda_tdma.f90: 12)
NVFORTRAN-S-0034-Syntax error at or near device (cuda_tdma.f90: 14)
NVFORTRAN-S-0034-Syntax error at or near device (cuda_tdma.f90: 15)
NVFORTRAN-S-0034-Syntax error at or near device (cuda_tdma.f90: 16)
NVFORTRAN-S-0034-Syntax error at or near device (cuda_tdma.f90: 18)

When I delete “device”, the program error is:
NVFORTRAN-S-0155-Could not resolve generic procedure cudamalloc (cuda_tdma.f90:27)
NVFORTRAN-S-0155-Could not resolve generic procedure cudamalloc (cuda_tdma.f90: 28)
NVFORTRAN-S-0155-Could not resolve generic procedure cudamalloc (cuda_tdma.f90: 29)
NVFORTRAN-S-0155-Could not resolve generic procedure cudamalloc (cuda_tdma.f90: 30)

It’s beyond my understanding QwQ. How can I solve this problem?
Looking forward to your reply. Thank you so much!