call to cuMemHostUnregister returned error 700: Launch faile

gaoyisheng · July 26, 2013, 9:02pm

OpenACC Fortran

(Please do not suggest using CUDA Fortran. I am working on a huge legacy code. OpenACC is the simplest way to port it to GPU)

This is a follow-up of the previous problem. I am trying to do a series of matrix inversions in a loop. The issue is that the number of iteration “niter” can only be quite small. If it is set such as 1000, you can see the returned message when executed

call to cuMemHostUnregister returned error 700: Launch failed

You should not try to change

!$acc parallel
!$acc loop private(x6,b6)

to

!$acc parallel private(x6,b6)
!$acc loop

Otherwise, the computing results will be incorrect, as there exist the race condition.

The source code is posted below

program matrixinverse

  implicit   real*8 (a-h,o-z)

  real*8  amatr(6,6,10000), x6(6,6), b6(6)

  niter = 10000
  n = 6

  amatr = 0.0d0
  do ie = 1, niter
  do j = 1, n
  amatr(j,j,ie) = dble(j)
  enddo
  enddo

!$acc parallel
!$acc loop private(x6,b6)
  do ie = 1, niter
  !
  ! LU decomposition by point 
  !
   do k = 1, n
     !
     ! decompose the diagonal terms
     !
     do m = 1, k-1
       !
       amatr(k,k,ie) = amatr(k,k,ie) - amatr(k,m,ie)*amatr(m,k,ie)
       !
     enddo
     !
     adicv = 1.0d0/amatr(k,k,ie)
     !
     ! decompose the non diagonal terms
     !
     do i = k+1, n
       !
       do m = 1, k-1
         !
         amatr(k,i,ie) = amatr(k,i,ie) - amatr(k,m,ie)*amatr(m,i,ie)
         amatr(i,k,ie) = amatr(i,k,ie) - amatr(i,m,ie)*amatr(m,k,ie)
         !
       enddo
       !
       amatr(i,k,ie) = amatr(i,k,ie)*adicv
       !
     enddo
     !
   enddo

   do i = 1, n
     !
     do j = 1, n
       !
       b6(j) = 0.0d0
       !
     enddo
     !
     b6(i) = 1.0d0
     !
     ! LU resolution
     !
     do ii = 1, n
       !
       c = 0.0d0
       !
       do jj = 1, ii-1
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i)
         !
       enddo
       !
       x6(ii,i) = b6(ii) - c
       !
     enddo
     !
     do ii = n, 1, -1
       !
       c = 0.0d0
       !
       do jj = ii+1, n
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i)
         !
       enddo
       !
       x6(ii,i) = (x6(ii,i) - c)/amatr(ii,ii,ie)
       !
     enddo
     !
   enddo
   !
   ! this is to test if the private array returns correct values
   !
   amatr(:,:,ie) = x6(:,:)
   !
  enddo
!$acc end parallel

!x$acc end region  
!x$acc end data region

  do ie = 1, niter
  write(*,*) ie
  do i = 1, n
  write(*,"(6f15.5)") (amatr(i,j,ie), j=1,n)
  enddo
  enddo

end

If “niter” is small enough, you will see the screen output in loop. The result should be uniformly the diagonal matrix
1, 1/2, 1/3, 1/4, 1/5, 1/6, as posted below

“No. of iteration”
1.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.50000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.33333 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.25000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.20000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.16667

MatColgrove · July 26, 2013, 10:56pm

Hi e3lb89cz,

Most likely this is the same issue that you reported earlier. One thing you can do is manually privatize x6 and b6 by adding an extra dimension. Then put them in a “create” clause so they aren’t copied to the GPU. Here’s an example:

program matrixinverse

  implicit   real*8 (a-h,o-z)

  real*8  amatr(6,6,10000), x6(6,6,10000), b6(6,10000)

  niter = 10000
  n = 6

!$acc data create(x6,b6) copyout(amatr)
!$acc kernels
  amatr = 0.0d0
  do ie = 1, niter
  do j = 1, n
  amatr(j,j,ie) = dble(j)
  enddo
  enddo
!$acc end kernels

!$acc parallel  
!$acc loop
  do ie = 1, niter
  !
  ! LU decomposition by point
  !
   do k = 1, n
     !
     ! decompose the diagonal terms
     !
     do m = 1, k-1
       !
       amatr(k,k,ie) = amatr(k,k,ie) - amatr(k,m,ie)*amatr(m,k,ie)
       !
     enddo
     !
     adicv = 1.0d0/amatr(k,k,ie)
     !
     ! decompose the non diagonal terms
     !
     do i = k+1, n
       !
       do m = 1, k-1
         !
         amatr(k,i,ie) = amatr(k,i,ie) - amatr(k,m,ie)*amatr(m,i,ie)
         amatr(i,k,ie) = amatr(i,k,ie) - amatr(i,m,ie)*amatr(m,k,ie)
         !
       enddo
       !
       amatr(i,k,ie) = amatr(i,k,ie)*adicv
       !
     enddo
     !
   enddo

   do i = 1, n
     !
     do j = 1, n
       !
       b6(j,ie) = 0.0d0
       !
     enddo
     !
     b6(i,ie) = 1.0d0
     !
     ! LU resolution
     !
     do ii = 1, n
       !
       c = 0.0d0
       !
       do jj = 1, ii-1
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i,ie)
         !
       enddo
       !
       x6(ii,i,ie) = b6(ii,ie) - c
       !
     enddo
     !
     do ii = n, 1, -1
       !
       c = 0.0d0
       !
       do jj = ii+1, n
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i,ie)
         !
       enddo
       !
       x6(ii,i,ie) = (x6(ii,i,ie) - c)/amatr(ii,ii,ie)
       !
     enddo
     !
   enddo
   !
   ! this is to test if the private array returns correct values
   !
   !
  enddo
!$acc end parallel
!$acc kernels 
   amatr(:,:,:) = x6(:,:,:)
!$acc end kernels
!$acc end data 

  do ie = 1, niter
  write(*,*) ie
  do i = 1, n
  write(*,"(6f15.5)") (amatr(i,j,ie), j=1,n)
  enddo
  enddo

end

gaoyisheng · July 27, 2013, 4:29pm

Hi Mat,

Thanks very much for your prompt feedback.

As you suggested, adding an extra dimension is indeed one solution to my problem, which I initially considered. However in the real application of my code, the total iteration is expected to be much more than the million scale (I am working on Computational Fluid Dynamics). So I think the memory overhead of adopting this strategy will be rather prohibitive. That was why I raised this problem.

Above all, it seems to me that, if my implementation is of no issue, for some reason, OpenACC Fortran is not able to afford private arrays in the accelerated region once the number of iteration is very large (although from the surface I think the estimable memory requirement of my problem is not that large at all).

Please let me know if anything wrong with my understanding. Thanks!

mkcolg:

Hi e3lb89cz,

Most likely this is the same issue that you reported earlier. One thing you can do is manually privatize x6 and b6 by adding an extra dimension. Then put them in a “create” clause so they aren’t copied to the GPU. Here’s an example:

program matrixinverse

  implicit   real*8 (a-h,o-z)

  real*8  amatr(6,6,10000), x6(6,6,10000), b6(6,10000)

  niter = 10000
  n = 6

!$acc data create(x6,b6) copyout(amatr)
!$acc kernels
  amatr = 0.0d0
  do ie = 1, niter
  do j = 1, n
  amatr(j,j,ie) = dble(j)
  enddo
  enddo
!$acc end kernels

!$acc parallel  
!$acc loop
  do ie = 1, niter
  !
  ! LU decomposition by point
  !
   do k = 1, n
     !
     ! decompose the diagonal terms
     !
     do m = 1, k-1
       !
       amatr(k,k,ie) = amatr(k,k,ie) - amatr(k,m,ie)*amatr(m,k,ie)
       !
     enddo
     !
     adicv = 1.0d0/amatr(k,k,ie)
     !
     ! decompose the non diagonal terms
     !
     do i = k+1, n
       !
       do m = 1, k-1
         !
         amatr(k,i,ie) = amatr(k,i,ie) - amatr(k,m,ie)*amatr(m,i,ie)
         amatr(i,k,ie) = amatr(i,k,ie) - amatr(i,m,ie)*amatr(m,k,ie)
         !
       enddo
       !
       amatr(i,k,ie) = amatr(i,k,ie)*adicv
       !
     enddo
     !
   enddo

   do i = 1, n
     !
     do j = 1, n
       !
       b6(j,ie) = 0.0d0
       !
     enddo
     !
     b6(i,ie) = 1.0d0
     !
     ! LU resolution
     !
     do ii = 1, n
       !
       c = 0.0d0
       !
       do jj = 1, ii-1
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i,ie)
         !
       enddo
       !
       x6(ii,i,ie) = b6(ii,ie) - c
       !
     enddo
     !
     do ii = n, 1, -1
       !
       c = 0.0d0
       !
       do jj = ii+1, n
         !
         c = c + amatr(ii,jj,ie)*x6(jj,i,ie)
         !
       enddo
       !
       x6(ii,i,ie) = (x6(ii,i,ie) - c)/amatr(ii,ii,ie)
       !
     enddo
     !
   enddo
   !
   ! this is to test if the private array returns correct values
   !
   !
  enddo
!$acc end parallel
!$acc kernels 
   amatr(:,:,:) = x6(:,:,:)
!$acc end kernels
!$acc end data 

  do ie = 1, niter
  write(*,*) ie
  do i = 1, n
  write(*,"(6f15.5)") (amatr(i,j,ie), j=1,n)
  enddo
  enddo

end

MatColgrove · July 29, 2013, 4:41pm

As you suggested, adding an extra dimension is indeed one solution to my problem, which I initially considered. However in the real application of my code, the total iteration is expected to be much more than the million scale (I am working on Computational Fluid Dynamics). So I think the memory overhead of adopting this strategy will be rather prohibitive. That was why I raised this problem.

When you use the “private” clause, each thread (or block depending where the private clause is applied), the entire private array is allocated before the kernel is launched and could be the same as manually privatizing the array. Hence, when you go larger, private wont help with memory size unless you use fixed size gang and vectors, which in turn will limit your parallelization.

Another solution, especially if n remains small, is to change your arrays to scalars. It’s a bit more coding work, but will allow the use of registers and lower the total memory used since the variables have the same lifetime as the thread.

Above all, it seems to me that, if my implementation is of no issue, for some reason, OpenACC Fortran is not able to afford private arrays in the accelerated region once the number of iteration is very large (although from the surface I think the estimable memory requirement of my problem is not that large at all).

I don’t have an answer for you on this one yet. I reported it to engineering as TPR#19484 and will post once I know more.

Mat

Topic		Replies	Views
call to cuMemFree returned error 700: Launch failed Legacy PGI Compilers	2	3002	July 26, 2013
Privatization of array Legacy PGI Compilers	9	17685	July 14, 2010
[OpenACC Fortran] Linear algebra in kernel loop Legacy PGI Compilers	6	8993	November 1, 2013
Pgfortran 20.4 and OpenACC giving "cudaLaunchKernel returned status 2: out of memory" Legacy PGI Compilers	17	1075	June 3, 2021
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9791	December 7, 2020
Six Loops iteration and reduction Legacy PGI Compilers	15	8021	March 27, 2012
Need advices for optimizing heart of CFD code Legacy PGI Compilers	11	7144	July 13, 2016
Code execution depends strangely on irrelevant parameters Legacy PGI Compilers	8	8158	October 22, 2013
runtime problems: call to cuMemAlloc returned error 700 Legacy PGI Compilers	4	4399	July 20, 2014
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	168	December 28, 2024

call to cuMemHostUnregister returned error 700: Launch faile

Related topics