A runtime error during use of cusparseDgtsv2 function in CUDA Fortran

Hey! i am encountering a runtime error while trying to run a simple code which uses cusparseDgtsv2. Can some one help me with this?
The code snippet is as follows:

real(dp), device,dimension(NImax) :: a_rd,b_rd,c_rd,r

       real(dp), dimension(NImax) :: a_prev,b_prev,alpha_rd,beta_rd
       real(dp), dimension(NImax) :: c_prev,r_prev

       integer(8) :: freeMem,totalMem
       integer(4) :: ierr,version;
       integer,device :: NImax2;
       character, allocatable, device :: buf(:)
       integer (c_size_t) :: N
       type(cusparseHandle) :: handle
       ierr = cusparseCreate(handle)
       ierr = cusparseGetVersion(handle, version)
       ierr = cusparseCreate(handle)


       ierr = cudaMemGetInfo(freeMem,totalMem)
        write(*,*) "Total device memory: ",totalMem, "bytes",dp, " ", version
        write(*,*) "Free device memory: ", freeMem, "bytes"
       write(*,*) "used device memory: ", totalMem - freeMem, "bytes"

        do i = 1, NImax
          c_rd(i) = 25
          a_rd(i) = 50
          b_rd(i) = 25
          r(i)    = 250
        end do
          a_rd(1) = 0
          c_rd(NImax) = 0


         do i = 1, NImax
          r_prev(i)    = r(i)
          a_prev(i) = a_rd(i)
        end do

 write(*,*) "Before",r_prev(25)
        ierr = cusparseDgtsv2_bufferSizeExt(handle,NImax,1,c_rd,b_rd,a_rd,r,NImax,N)
        allocate (buf(N))
        write(*,*) "buffer ",N
          ierr = cusparseDgtsv2(handle,NImax,1,c_rd,b_rd,a_rd,r,NImax,c_loc(buf));
        !!$acc end host_data
        !!acc end data


        r_prev = r

        write(*,*) "After",a_prev(25)

The Output is as follows:

Total device memory:                6226378752 bytes            8   
        12001
 Free device memory:                6049955840 bytes
 used device memory:                 176422912 bytes
 Before    250.0000000000000     
 buffer                     30336
0: copyout Memcpy (host=0x202ad10, dev=0x7f42cb1e4800, size=1024) FAILED: 700(an illegal memory access was encountered)

How to resolve this?

That error message is actually generated from OpenACC, correct?

Please provide a short, complete code that demonstrates the issue, along with the exact compile command you used to compile the code.

@Robert_Crovella I exactly don’t know if the error message is from CUDA or OpenACC. I have an in-house solver which is parallelized using OpeanACC. I am trying to accelerate the code further by trying to use CUDA libraries which are known to perform better, especially for Tri-diagonal system. Since my code has many files to be linked I use a Makefile to compile the code. Here is a rough snippet of the Make file I use.

FC=nvfortran -pg -acc  -cuda  -Minfo=accel  -cudalib=cusparse
#OP_LINK = -w -g -mcmodel=large
#FLAGS=-DNOT_ALLPERIODIC
run: main.o grid.o initial.o tdma.o Module.o print.o

$(FC)  -o run main.o grid.o initial.o tdma.o  Module.o

main.o: main.F90 Module.o
        $(FC) -c main.F90
grid.o: grid.F90 Module.o
        $(FC) -c grid.F90
initial.o: initial.F90 Module.o
        $(FC) -c initial.F90
tdma.o: tdma.cuf Module.o
        $(FC) -c tdma.cuf
Module.o: Module.F90
        $(FC) -c Module.F90
clean:
        rm -f *.o run *.mod

As far as code is concerned, I am sharing you the TDMA subroutine which is being called from main.F90.

SUBROUTINE TDMA(A,B,C)
       use declare_variables
       use cudafor
       use cusparse
       use, intrinsic :: iso_c_binding
       implicit none

       integer blocks, C
       integer :: p_rd,s,e
       real(dp) :: maxi,bufferSize



       real(dp),dimension(NImax,NJmax,NKmax,nblocks,nvars) :: A, B
       real(dp),dimension(NImax) :: F,xout
       real(dp) :: b4,a2
       real(dp),dimension(NImax) :: AMlocal,APlocal,AClocal

       real(dp), device,dimension(NImax) :: a_rd,b_rd,c_rd,r

       real(dp), dimension(NImax) :: a_prev,b_prev,alpha_rd,beta_rd
       real(dp), dimension(NImax) :: c_prev,r_prev

       integer(8) :: freeMem,totalMem
       integer(4) :: ierr,version;
       integer,device :: NImax2;
       character, allocatable, device :: buf(:)
       integer (c_size_t) :: N
       type(cusparseHandle) :: handle
       ierr = cusparseCreate(handle)
       ierr = cusparseGetVersion(handle, version)
       ierr = cusparseCreate(handle)


       ierr = cudaMemGetInfo(freeMem,totalMem)
        write(*,*) "Total device memory: ",totalMem, "bytes",dp, " ", version
        write(*,*) "Free device memory: ", freeMem, "bytes"
       write(*,*) "used device memory: ", totalMem - freeMem, "bytes"

        do i = 1, NImax
          c_rd(i) = 25
          a_rd(i) = 50
          b_rd(i) = 25
          r(i)    = 250
        end do
          a_rd(1) = 0
          c_rd(NImax) = 0


         do i = 1, NImax
          r_prev(i)    = r(i)
         a_prev(i) = a_rd(i)
        end do
       
         write(*,*) "Before",r_prev(25)
        ierr = cusparseDgtsv2_bufferSizeExt(handle,NImax,1,c_rd,b_rd,a_rd,r,NImax,N)
        allocate (buf(N))
        write(*,*) "buffer ",N
          ierr = cusparseDgtsv2(handle,NImax,1,c_rd,b_rd,a_rd,r,NImax,c_loc(buf));
        !!$acc end host_data
        !!acc end data


        r_prev = r

        write(*,*) "After",a_prev(25)

        do blocks=1,C
         do k=1, nkmax
!$acc parallel loop private(** some variables **) &
!$acc present(** some variables)
          do j=1,njmax
                do i=1,nimax

                !Original TDMA algorithm to calculate the derivatives
                ! Parameter A and B passed to subroutine are used here

          enddo
         enddo
!$acc end parallel
        enddo
       enddo

I hope you understand the code flow. I cannot share you the actual code for some reasons.
One thing to mention is that, in the above subroutine I am attempting to see if cuSparseDgtv actually works. Once it works I plan to modify the structure of the subroutine to suffice my purpose.

Thank you in advance!

I’m not asking for your actual code. If you want to share a short, complete test case (one file, complete), along with the exact compile command used to compile it, I will take a look as time permits.

I took an initial look at your first posting and did not spot any obvious issues. Based on your additional comments its clear to me you are mixing in OpenACC. I’m going to move this in case someone else can spot something.

@Robert_Crovella Thanks for your support. Here is a short code which can be copied directly to debug the issue.

PROGRAM TDMA_CUDA


        use cudafor
        use cusparse
        use, intrinsic :: iso_c_binding
        implicit none

        real(8), device,dimension(128) :: a_rd,b_rd,c_rd,r
        real(8), dimension(128) :: a_prev,r_prev,b_prev
        INTEGER ::NImax,i


         integer(8) :: freeMem,totalMem
       integer(4) :: ierr,version;
       integer,device :: NImax2;
       character, allocatable, device :: buf(:)
       integer (c_size_t) :: N
       type(cusparseHandle) :: handle
        NImax =128
       ierr = cusparseCreate(handle)
       ierr = cusparseGetVersion(handle, version)
       ierr = cusparseCreate(handle)

        ierr = cudaMemGetInfo(freeMem,totalMem)
        write(*,*) "Total device memory: ",totalMem, "bytes", " ", version
        write(*,*) "Free device memory: ", freeMem, "bytes"
       write(*,*) "used device memory: ", totalMem - freeMem, "bytes"

        do i = 1, NImax
          c_rd(i) = 25
          a_rd(i) = 50
          b_rd(i) = 25
          r(i)    = 250
        end do
          a_rd(1) = 0
          c_rd(NImax) = 0


         do i = 1, NImax
          r_prev(i)    = r(i)
          a_prev(i) = a_rd(i)
        end do

         write(*,*) "Before",r_prev(25)
        ierr = cusparseDgtsv2_bufferSizeExt(handle,NImax,1,c_rd,b_rd,a_rd,r,NImax,N)
        allocate (buf(N))
        write(*,*) "buffer ",N
          ierr = cusparseDgtsv2(handle,NImax,1,c_rd,b_rd,a_rd,r,NImax,c_loc(buf));
        !!$acc end host_data
        !!acc end data
     
         r_prev = r

        write(*,*) "After",a_prev(25)

        !$acc parallel loop copy(b_rd,b_prev)
        do i = 1, NImax
         b_rd(i) = b_prev(i)
        end do
        !$acc end parallel


        END PROGRAM


And here is the compile command I have been using,

nvfortran -pg -acc   -cuda  -Minfo=accel  -cudalib=cusparse -o run tdma.cuf

Try dropping the c_loc():

ierr = cusparseDgtsv2(handle,NImax,1,c_rd,b_rd,a_rd,r,NImax,buf);

When I do that I get the following output:

# compute-sanitizer ./run
========= COMPUTE-SANITIZER
 Total device memory:               84979089408 bytes         12001
 Free device memory:               83799441408 bytes
 used device memory:                1179648000 bytes
 Before    250.0000000000000
 buffer                     30336
 After    50.00000000000000
========= ERROR SUMMARY: 0 errors
#

@Robert_Crovella , It works. Thank you!