call to cuMemHostUnregister returned error 700: Launch faile

OpenACC Fortran

(Please do not suggest using CUDA Fortran. I am working on a huge legacy code. OpenACC is the simplest way to port it to GPU)

This is a follow-up of the previous problem. I am trying to do a series of matrix inversions in a loop. The issue is that the number of iteration “niter” can only be quite small. If it is set such as 1000, you can see the returned message when executed

call to cuMemHostUnregister returned error 700: Launch failed

You should not try to change

!$acc parallel
!$acc loop private(x6,b6)

to

!$acc parallel private(x6,b6)
!$acc loop

Otherwise, the computing results will be incorrect, as there exist the race condition.

The source code is posted below

program matrixinverse

  implicit   real*8 (a-h,o-z)

  real*8  amatr(6,6,10000), x6(6,6), b6(6)

  niter = 10000
  n = 6

  amatr = 0.0d0
  do ie = 1, niter
  do j = 1, n
  amatr(j,j,ie) = dble(j)
  enddo
  enddo

!$acc parallel
!$acc loop private(x6,b6)
  do ie = 1, niter
  !
  ! LU decomposition by point 
  !
   do k = 1, n
     !
     ! decompose the diagonal terms
     !
     do m = 1, k-1
       !
       amatr(k,k,ie) = amatr(k,k,ie) - amatr(k,m,ie)*amatr(m,k,ie)
       !
     enddo
     !
     adicv = 1.0d0/amatr(k,k,ie)
     !
     ! decompose the non diagonal terms
     !
     do i = k+1, n
       !
       do m = 1, k-1
         !
         amatr(k,i,ie) = amatr(k,i,ie) - amatr(k,m,ie)*amatr(m,i,ie)
         amatr(i,k,ie) = amatr(i,k,ie) - amatr(i,m,ie)*amatr(m,k,ie)
         !
       enddo
       !
       amatr(i,k,ie) = amatr(i,k,ie)*adicv
       !
     enddo
     !
   enddo

   do i = 1, n
     !
     do j = 1, n
       !
       b6(j) = 0.0d0
       !
     enddo
     !
     b6(i) = 1.0d0
     !
     ! LU resolution
     !
     do ii = 1, n
       !
       c = 0.0d0
       !
       do jj = 1, ii-1
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i)
         !
       enddo
       !
       x6(ii,i) = b6(ii) - c
       !
     enddo
     !
     do ii = n, 1, -1
       !
       c = 0.0d0
       !
       do jj = ii+1, n
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i)
         !
       enddo
       !
       x6(ii,i) = (x6(ii,i) - c)/amatr(ii,ii,ie)
       !
     enddo
     !
   enddo
   !
   ! this is to test if the private array returns correct values
   !
   amatr(:,:,ie) = x6(:,:)
   !
  enddo
!$acc end parallel

!x$acc end region  
!x$acc end data region

  do ie = 1, niter
  write(*,*) ie
  do i = 1, n
  write(*,"(6f15.5)") (amatr(i,j,ie), j=1,n)
  enddo
  enddo

end

If “niter” is small enough, you will see the screen output in loop. The result should be uniformly the diagonal matrix
1, 1/2, 1/3, 1/4, 1/5, 1/6, as posted below

“No. of iteration”
1.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.50000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.33333 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.25000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.20000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.16667

Hi e3lb89cz,

Most likely this is the same issue that you reported earlier. One thing you can do is manually privatize x6 and b6 by adding an extra dimension. Then put them in a “create” clause so they aren’t copied to the GPU. Here’s an example:

program matrixinverse

  implicit   real*8 (a-h,o-z)

  real*8  amatr(6,6,10000), x6(6,6,10000), b6(6,10000)

  niter = 10000
  n = 6

!$acc data create(x6,b6) copyout(amatr)
!$acc kernels
  amatr = 0.0d0
  do ie = 1, niter
  do j = 1, n
  amatr(j,j,ie) = dble(j)
  enddo
  enddo
!$acc end kernels

!$acc parallel  
!$acc loop
  do ie = 1, niter
  !
  ! LU decomposition by point
  !
   do k = 1, n
     !
     ! decompose the diagonal terms
     !
     do m = 1, k-1
       !
       amatr(k,k,ie) = amatr(k,k,ie) - amatr(k,m,ie)*amatr(m,k,ie)
       !
     enddo
     !
     adicv = 1.0d0/amatr(k,k,ie)
     !
     ! decompose the non diagonal terms
     !
     do i = k+1, n
       !
       do m = 1, k-1
         !
         amatr(k,i,ie) = amatr(k,i,ie) - amatr(k,m,ie)*amatr(m,i,ie)
         amatr(i,k,ie) = amatr(i,k,ie) - amatr(i,m,ie)*amatr(m,k,ie)
         !
       enddo
       !
       amatr(i,k,ie) = amatr(i,k,ie)*adicv
       !
     enddo
     !
   enddo

   do i = 1, n
     !
     do j = 1, n
       !
       b6(j,ie) = 0.0d0
       !
     enddo
     !
     b6(i,ie) = 1.0d0
     !
     ! LU resolution
     !
     do ii = 1, n
       !
       c = 0.0d0
       !
       do jj = 1, ii-1
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i,ie)
         !
       enddo
       !
       x6(ii,i,ie) = b6(ii,ie) - c
       !
     enddo
     !
     do ii = n, 1, -1
       !
       c = 0.0d0
       !
       do jj = ii+1, n
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i,ie)
         !
       enddo
       !
       x6(ii,i,ie) = (x6(ii,i,ie) - c)/amatr(ii,ii,ie)
       !
     enddo
     !
   enddo
   !
   ! this is to test if the private array returns correct values
   !
   !
  enddo
!$acc end parallel
!$acc kernels 
   amatr(:,:,:) = x6(:,:,:)
!$acc end kernels
!$acc end data 

  do ie = 1, niter
  write(*,*) ie
  do i = 1, n
  write(*,"(6f15.5)") (amatr(i,j,ie), j=1,n)
  enddo
  enddo

end

Hi Mat,

Thanks very much for your prompt feedback.

As you suggested, adding an extra dimension is indeed one solution to my problem, which I initially considered. However in the real application of my code, the total iteration is expected to be much more than the million scale (I am working on Computational Fluid Dynamics). So I think the memory overhead of adopting this strategy will be rather prohibitive. That was why I raised this problem.

Above all, it seems to me that, if my implementation is of no issue, for some reason, OpenACC Fortran is not able to afford private arrays in the accelerated region once the number of iteration is very large (although from the surface I think the estimable memory requirement of my problem is not that large at all).

Please let me know if anything wrong with my understanding. Thanks!

As you suggested, adding an extra dimension is indeed one solution to my problem, which I initially considered. However in the real application of my code, the total iteration is expected to be much more than the million scale (I am working on Computational Fluid Dynamics). So I think the memory overhead of adopting this strategy will be rather prohibitive. That was why I raised this problem.

When you use the “private” clause, each thread (or block depending where the private clause is applied), the entire private array is allocated before the kernel is launched and could be the same as manually privatizing the array. Hence, when you go larger, private wont help with memory size unless you use fixed size gang and vectors, which in turn will limit your parallelization.

Another solution, especially if n remains small, is to change your arrays to scalars. It’s a bit more coding work, but will allow the use of registers and lower the total memory used since the variables have the same lifetime as the thread.

Above all, it seems to me that, if my implementation is of no issue, for some reason, OpenACC Fortran is not able to afford private arrays in the accelerated region once the number of iteration is very large (although from the surface I think the estimable memory requirement of my problem is not that large at all).

I don’t have an answer for you on this one yet. I reported it to engineering as TPR#19484 and will post once I know more.

  • Mat