cuCtxSynchronize error 700

I am trying to accelerate a single subroutine. If I put a compute region around the full content of the routine I get an error:

“call to cuCtxSynchronize returned error 700: Launch failed”

However, if I split the routine up into two compute regions everything executes correctly (extra unwanted overhead, but it works). Its not clear to me why this happens.

Looking at the code its possible that there are bank conflicts (it is a 2D FD stencil), but the kernel should still launch. More detail on the error or similar experience would be appreciated.

A quick search brought up this forum post:

Could it be that the PG compiler is making the same cuParamSeti mistake? I AM on an x86_64 system with 4 Tesla 1060s…

Hi bollig,

This is generic error usually meaning that there was a memory error (such as a segv) when transferring memory over to the GPU. Most likely the two loops share some common arrays where the compiler is getting confused about the bounds. I found a similar issue in one the codes I was working on and reported it to our engineers. I was able to work around the problem by using the “copy”, “copyin” and “copyout” clauses to explicitly set the array bounds. Use the “-Minfo=accel” messages produced during compilation to see what bounds the compiler is using.

If you can, please send a report to PGI customer service ( and include the source (plus any data files and build instructions if needed).



I have a new case of this, so I’m bumping this thread (the title is about what I’d use). If I don’t use local, copyin, and copyout, my code errors out with a cuMemcpy2D. But, adding them leads to a cuCtxSynchronize error. In case it’s something simple (like a dimension I’m missing), I’m reproducing my variables and !$acc region lines:

c-----input parameters

      integer m,np,ict,icb,ih1,ih2,im1,im2,is1,is2
      real rr(m,0:np+1,2),tt(m,0:np+1,2),td(m,0:np+1,2)
      real rs(m,0:np+1,2),ts(m,0:np+1,2)
      real cc(m,3)

c-----temporary array

      integer i,k,ih,im,is
      real rra(m,0:np+1,2,2),tta(m,0:np+1,2,2),tda(m,0:np+1,2,2)
      real rsa(m,0:np+1,2,2),rxa(m,0:np+1,2,2)
      real ch(m),cm(m),ct(m),flxdn(m,0:np+1)
      real fdndir(m),fdndif(m),fupdif
      real denm,xx,yy

c-----output parameters

      real fclr(m,np+1),fall(m,np+1)
      real fsdir(m),fsdif(m)

!$acc region
!$acc& copyin(rr(1:m,0:np+1,1:2),
!$acc& tt(1:m,0:np+1,1:2),
!$acc& td(1:m,0:np+1,1:2),
!$acc& rs(1:m,0:np+1,1:2),
!$acc& ts(1:m,0:np+1,1:2),
!$acc& cc(1:m,1:3))
!$acc& copyout(fclr(1:m,1:np+1),
!$acc& fall(1:m,1:np+1),
!$acc& fsdir(1:m),
!$acc& fsdif(1:m))
!$acc& local(rra(1:m,0:np+1,1:2,1:2),
!$acc& tta(1:m,0:np+1,1:2,1:2),
!$acc& tda(1:m,0:np+1,1:2,1:2),
!$acc& rsa(1:m,0:np+1,1:2,1:2),
!$acc& rxa(1:m,0:np+1,1:2,1:2),
!$acc& ch(1:m),
!$acc& cm(1:m),
!$acc& ct(1:m),
!$acc& flxdn(1:m,0:np+1),
!$acc& fdndir(1:m),
!$acc& fdndif(1:m))

As near as I can tell, I have the array dimensions correct. Upon compiling, the -Minfo=accel outputs:

   2316, Generating copyin(td(:m,:np+1,:))
         Generating copyin(tt(:m,:np+1,:))
         Generating copyin(rs(:m,:np+1,:))
         Generating copyin(rr(:m,:np+1,:))
         Generating copyin(ts(:m,:np+1,:))
         Generating local(ch(:m))
         Generating copyin(cc(:m,:))
         Generating local(cm(:m))
         Generating copyout(fsdir(:m))
         Generating copyout(fsdif(:m))
         Generating local(flxdn(:m,:np+1))
         Generating local(fdndif(:m))
         Generating local(fdndir(:m))
         Generating local(tta(:m,:np+1,:,:))
         Generating local(tda(:m,:np+1,:,:))
         Generating local(rsa(:m,:np+1,:,:))
         Generating local(ct(:m))
         Generating local(rxa(:m,:np+1,:,:))
         Generating local(rra(:m,:np+1,:,:))
         Generating copyout(fclr(:m,:np+1))
         Generating copyout(fall(:m,:np+1))

The compiler seems to have suppressed some dimensions to “:”, could that do it? Or should I use ct(m) rather than ct(1:m), say?

ETA: I’m suspecting bug after some trial and error. I’ll submit to Customer Service.