Error when running optimized code but runs fine with debug (different than my previous post)

Hello,

am trying to run my multi-GPU code using mpif90 and OpenACC. I am using the following compilation flags:

 -acc -fast -ta=multicore -ta=tesla:managed -Minfo=accel

and I run the code with:

mpirun --allow-run-as-root -np $(np) ./bin/dew

When I do this, I get the following error:

Failing in Thread:1
call to cuMemcpyDtoHAsync returned error 719: Launch failed (often invalid pointer dereference)

FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[10834,1],0]
  Exit code:    1
--------------------------------------------------------------------------
make: *** [run] Error 1

I tried to debug using the following flags:

-g -C

Initially, there were a few errors that I have since fixed. However, now when I run the code using the debug flags, it runs fine. If I switch back to the optimized flags (first ones I provided), the code returns the error I pasted above. I tried using compute-sanitizer but that did not really help either.

Any advice would be appreciated. Thank you.