cuMemcpyDtoH error

Hi,

I’m trying to use the accelerator directives on a simple loop, but the compilet gives this error:

call to cuMemcpyDtoH returned error 700: Launch failed

The code that I’m using with the !$acc directive is

!$acc region copyin(r_c(1:resids,1:3,1:2),r_cb(1:resids,1:3,1:2)),&
!$acc copyin(epsij(1:20),dat1(1:resids,1)),&
!$acc copyout(Ener)
do j=1,resids
Vbb=0.0D0
Vhp=0.0D0
mol1=dat1(i,1)
mol2=dat1(j,1)

if (mol1.ne.10) then
dx=(r_cb(i,1,1)-r_c(j,1,2))
dy=(r_cb(i,2,1)-r_c(j,2,2))
dz=(r_cb(i,3,1)-r_c(j,3,2))
r=(dx2+dy2+dz**2)0.50D0
sigma=(s_c+s_cb)/2.0D0
rc=sigma*2.0D0
(1.0D0/6.0D0)
if (r.le.rc) then
rr=(sigma/r)6
Vbb=4.0D0epsbb&
(rr
2-rr+0.250D0)
end if
end if

if ((mol1.ne.10).and.(mol2.ne.10)) then
dx=r_cb(i,1,1)-r_cb(j,1,2)
dy=r_cb(i,2,1)-r_cb(j,2,2)
dz=r_cb(i,3,1)-r_cb(j,3,2)
r=(dx2+dy2+dz**2)**0.5000
rr=(s_cb/r)6
rc=s_cb*2.0
(1.0/6.0)
eij=(epsij(mol1)*epsij(mol2))**0.50D0

if (r.le.rc) then
Vhp=4.0D0epshp(rr2-rr)+&
epshp*(1.0D0-eij)
else
Vhp=4.0D0epshpeij*(rr
2-rr)
end if

end if

Ener(j)=Vhp+Vbb

end do
!$acc end region


I don’t understand why it gives me that error, since the arrays are small (the variable resids is not bigger than 10), and my GPU has 1.5GB of memory. Could you help me with this problem?

Thanks,
Marco

Hi Marco,

Would it possible for you to send me an example code which exhibits this behavior? Is so, please send a report to PGI Customer Service (trs@pgroup.com) and ask them to send it to me.

The error “cuMemcpyDtoH” means there was a failure in copying from the device to the host. It could actually be an error in the copy (for example if Ener’s size is smaller then resids) but could also mean the kernel itself has an error. Most likely it’s a problem with the compiler, but I’ll need a full example to tell.

Thanks,
Mat

Hi Mat,
I’ve been checking the code, but I haven’t found any mistake in terms of vector sizes or undeclared varaibles. I sent the sample code the same day that you asked me, have you checked it? Is it a problem with the compiler?

Thanks,
Marco

Hi Marco,

I went through all of TRS mail back til the 10th, but don’t see any messages from you. It’s possible that it got stopped by the corporate spam filter or the attachment was too big. I’ll send you a email directly.

  • Mat

Hi Marco,

Thank you for the code. This does appear to be compiler error where a bad value is being used when initializing the cached copies of a variable. I have submitted this problem to our engineers as TPR#16500.

The error does appear to have been found and fix in our internal development compiler and I have requested that this fix be add to our next release (10.2) due at the beginning of February. The work around to this issue is to use the flag “-ta=nvidia,oldcg”.

Best Regards,
Mat

Thanks Mat. Using the “-ta=nvidia,oldcg” flag the code runs correctly. However, I stil have a problem. That piece of code is just a subroutine in a main code. When I copy that subroutine, with the same accelerator directives, it gives me another error:

“call ctxSynchronize returned error 700: Launch failed”

What does that error mean?

Thanks,
Marco

Hi Marco,

It’s a generic error so could be caused by a number of things. Typically though I’ve seen it when there was a seg fault copying the data to the device or a seg fault in the kernel.

  • Mat

Mat,
How it could happend if the seg runs perfectly with the acc directive when it is isolated, and it is the only seg in the code that uses an accelerator region.

Could you give me any hint to solve that issue?

Thanks,
Marco

Array bounds violation? Feel free to send me the full source if you’re able.

Thanks Mat. I’d appreciate if you could help me with that, since I don’t understand why it gives me that error if the subroutine runs perfectly when I copy it to a different project. I’ll send both codes (the full code which gives me the error, and the code with just the subroutine) to the same email.

Thanks again,
Marco

Hi Marco,

I have had an error with similar behavior, I managed to bypass/fix it by declaring the inner (non-parallel) loops of my kernel as sequential using !$acc do seq.

Good luck!

Karl

May I ask you what is “oldcg” for?

Tuan

Hi Tuan,

“oldcg” is being used a work around for a bug in the “newcg”. “cg” stands for code-generator. As of 10.0, we added new code generator targeting the NVIDIA GPU. Unfortunately, like may new features, there are bugs. In this case, the bug did not occur in the old code generator from the 9.0 release. Note that “oldcg” flag is not documented and will eventually go away.

  • Mat