[OpenACC Fortran] Linear algebra in kernel loop

I was trying to perform linear algebraic algorithms in the OpenACC kernel loop, e.g., matrix inversion, (n=6, ntotal=1000)

real*8  a(n, n, ntotal), b(n,n), c(n)

!$acc region
!$acc loop kernel independent  private(b, c)
do i = 1, ntotal
...
get the inverse of matrix a(:, :, i) and store it in a(:, :, i), 
where b and c are necessary auxiliary local arrays.
...
enddo
!$acc end region

However, when ntotal is large, e.g., ntotal > 200, I got the error message like below:

call to cuMemFree returned error 700: Launch failed

I think that the compiler specifies memory for the “private” b and c arrays like nnntotal, and n*ntotal to make them private enough. But this account too much memory.

Is there anyone also working on matrix linear algebra that requires local matrices? I guess this should be a common issue if treated in a naive manner like I did. I am keen to know how to get over this issue.

Any comment is greatly welcome!

Hi!

Do you really want to implement it with OpenACC?
Use cuBLAS library.

Alexey

Alexey. He said he is learning! Let him get an answer!!

Malcolm

Hi e3lb89cz,

I think that the compiler specifies memory for the “private” b and c arrays like nnntotal, and n*ntotal to make them private enough.

Each thread will get there own copy of b and c, which may or may not be equal to ntotal.

But this account too much memory

It’s definitely possible that you can run out of memory by having too much private data, but I’m not convince that’s what’s happening here. At N=6, NTOTAL=200, and assuming the number of threads is 256, you’re data usage is very small.

call to cuMemFree returned error 700: Launch failed

This typically means that the kernel that was launched before this call to cuMemFree crashed for some reason. Why? I’d need a reproducing example to find out. Though, the first thing to do is make sure the host version is correct and that you’re not hitting any out-of-bound errors (add -Mbounds flag to check) or other array access issues.

  • Mat

I have distilled a code with only memory pass. Please take a look and try to compile and run it. You may get the same error message.

call to cuMemFree returned error 700: Launch failed

There should be no memory bounds error in the host version, as I checked. If you happen to know how it occurs, please let me know. I am keen to overcome this issue. Thanks a lot in advance!

program inversematrix

  implicit   real*8 (a-h,o-z)

  real*8  a(6,6,10000)
  real*8  c(6,6), L(6,6), U(6,6), b(6), d(6), x(6)

  niter = 10000
  n = 6

  a = 0.0d0
  do ie = 1, niter
  do i = 1, n
  a(i,i,ie) = 1.0d0
  enddo
  enddo

!$acc data region 
!$acc region
!$acc loop kernel independent private(c,L,U,b,d,x) 
  do ie = 1, niter

  c(:,:)=a(:,:,ie)
  L=c
  U=L
  b(:) = U(:,1)
  d=b
  x=d
  a(:,:,ie)=L(:,:)

  enddo
!$acc end region  
!$acc end data region

end program inversematrix







Hi e3lb89cz,

I’m still not convienced that it’s just a memory limit issue but I don’t see the exact cause myself so will pass this on to engineering (logged as TPR#19484).

As a work around, you can limit the privatization to each gang by putting the private clause on the parallel construct. Though, you’ll be limiting the amount of parallelziation.

% cat test2.f90 
program inversematrix

  implicit   real*8 (a-h,o-z)

  real*8  a(6,6,10000)
  real*8 ::  c(6,6), L(6,6), U(6,6), b(6), d(6), x(6)

  niter = 10000
  n = 6

  a = 0.0d0
  do ie = 1, niter
  do i = 1, n
  a(i,i,ie) = 1.0d0
  enddo
  enddo

!$acc parallel private(c,L,U,b,d,x) 
!$acc loop
  do ie = 1, niter

  c(:,:)=a(:,:,ie)
  L=c
  U=L
  b(:) = U(:,1)
  d=b
  x=d
  a(:,:,ie)=L(:,:)

  enddo
!$acc end parallel

  print *, a(1,1,1)
end program inversematrix
% pgf90 -acc -Minfo test2.f90 -ta=nvidia ; a.out
inversematrix:
     11, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     18, Accelerator kernel generated
         20, !$acc loop gang ! blockidx%x
         22, !$acc loop vector(256) ! threadidx%x
         25, !$acc loop vector(256) ! threadidx%x
         28, !$acc loop vector(256) ! threadidx%x
     18, Generating present_or_copy(a(:,:,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     22, Loop is parallelizable
     25, Loop is parallelizable
     28, Loop is parallelizable
    1.000000000000000
  • Mat

The current 13.10 release corrects this reported problem.

thanks,
dave