call to cuMemFree returned error 700: Launch failed

I have distilled a code with basically does nothing but passes values. Please take a look and try to compile and run it. You may get the error message.

call to cuMemFree returned error 700: Launch failed

There should be no memory bounds error in the host version, as I checked. If you happen to know how it occurs, please let me know. I am keen to overcome this issue. Thanks a lot in advance!

program inversematrix 

  implicit   real*8 (a-h,o-z) 

  real*8  a(6,6,10000) 
  real*8  c(6,6), L(6,6), U(6,6), b(6), d(6), x(6) 

  niter = 10000 
  n = 6 

  a = 0.0d0 
  do ie = 1, niter 
  do i = 1, n 
  a(i,i,ie) = 1.0d0 
  enddo 
  enddo 

!$acc data region 
!$acc region 
!$acc loop kernel independent private(c,L,U,b,d,x) 
  do ie = 1, niter 

  c(:,:)=a(:,:,ie) 
  L=c 
  U=L 
  b(:) = U(:,1) 
  d=b 
  x=d 
  a(:,:,ie)=L(:,:) 

  enddo 
!$acc end region  
!$acc end data region 

end program inversematrix

Please use:

pgf90 inverse.f90 -o run -r8 -O2 -g -traceback -pg -acc -ta=nvidia,flushz,time,cc20,keepgpu,keepptx -Mcuda=ptxinfo -Minfo=accel

Hi e3lb89cz,

I’m still not convienced that it’s just a memory limit issue but I don’t see the exact cause myself so will pass this on to engineering (logged as TPR#19484).

As a work around, you can limit the privatization to each gang by putting the private clause on the parallel construct. Though, you’ll be limiting the amount of parallelziation.

	
% cat test2.f90
program inversematrix

  implicit   real*8 (a-h,o-z)

  real*8  a(6,6,10000)
  real*8 ::  c(6,6), L(6,6), U(6,6), b(6), d(6), x(6)

  niter = 10000
  n = 6

  a = 0.0d0
  do ie = 1, niter
  do i = 1, n
  a(i,i,ie) = 1.0d0
  enddo
  enddo

!$acc parallel private(c,L,U,b,d,x)
!$acc loop
  do ie = 1, niter

  c(:,:)=a(:,:,ie)
  L=c
  U=L
  b(:) = U(:,1)
  d=b
  x=d
  a(:,:,ie)=L(:,:)

  enddo
!$acc end parallel

  print *, a(1,1,1)
end program inversematrix
% pgf90 -acc -Minfo test2.f90 -ta=nvidia ; a.out
inversematrix:
     11, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     18, Accelerator kernel generated
         20, !$acc loop gang ! blockidx%x
         22, !$acc loop vector(256) ! threadidx%x
         25, !$acc loop vector(256) ! threadidx%x
         28, !$acc loop vector(256) ! threadidx%x
     18, Generating present_or_copy(a(:,:,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     22, Loop is parallelizable
     25, Loop is parallelizable
     28, Loop is parallelizable
    1.000000000000000
  • Mat

Hi Mat,

Thank you for your feedback. What you suggest can really help avoid the mentioned memory issue. However, the real problem is that if I do a serious linear algebra in the loop using those private arrays, the final results are incorrect. I tried to move the private clause from parallel struct to the loop struct, and it can give me the correct computing results. But again, by doing so, the number of iterations can only be set at a small number. I will post a new topic on this issue very soon, and you can take a look at it by then.