I am posting for the first time so I apologize in advance for any mistakes I do.

I am posting a version of a code I am trying to accelerate using openacc.

```
subroutine get_common(mesh, beta)
type(triangle_mesh), intent(inout) :: mesh
real(c_double), optional, intent(in ) :: beta
! locals
integer(c_int) :: j, k, glb_face_index, left_face_index, rght_face_index, left, rght
integer(c_int) :: elem2face_left(mesh%elem_num_faces), elem2face_rght(mesh%elem_num_faces)
integer(c_int) :: left_indices(2), rght_indices(2)
integer(c_int) :: face2elem(2)
real(c_double) :: ibeta
real(c_double) :: uflux_left(mesh%P1, mesh%Nvar), uflux_rght(mesh%P1, mesh%Nvar)
real(c_double) :: uflux_tmp(mesh%P1, mesh%Nvar), uflux_avg(mesh%P1, mesh%Nvar)
integer(c_int) :: Nfaces, elem_num_faces, P1, Nvar
real(c_int) :: devFace2elem(mesh%Nfaces, 2), devElem2face(mesh%Nelements, mesh%elem_num_faces)
real(c_double) :: devUflux(mesh%Nflux, mesh%Nvar, mesh%Nelements), devUcommon(mesh%Nflux, mesh%Nvar, mesh%Nelements)
real(c_double) :: devtemp(2)
Nfaces = mesh%Nfaces
Nvar = mesh%Nvar
P1 = mesh%P1
elem_num_faces = mesh%elem_num_faces
devFace2elem = mesh%face2elem
devElem2face = mesh%elem2face
devUflux = mesh%uflux
devUcommon = mesh%ucommon
uflux_left = 0
write(*,*) 'CPU', devUflux(2,2,2)
ibeta = HALF; if (present(beta)) ibeta = beta
!$acc parallel loop copyin(devUflux(1:mesh%Nflux, 1:mesh%Nvar, 1:mesh%Nelements))
do glb_face_index = 1, Nfaces
face2elem = devFace2elem(glb_face_index,:)
if ( .not. (any(face2elem .eq. 0)) ) then
left = face2elem(1); rght = face2elem(2)
elem2face_left(:) = devElem2face(left, :)
elem2face_rght(:) = devElem2face(rght, :)
do j = 1, elem_num_faces
if( elem2face_left(j) .eq. glb_face_index) left_face_index = j
if( elem2face_rght(j) .eq. glb_face_index) rght_face_index = j
end do
left_indices(1) = P1*(left_face_index-1)+1
left_indices(2) = P1*left_face_index
rght_indices(1) = P1*(rght_face_index-1)+1
rght_indices(2) = P1*rght_face_index
write(*,*) devUflux(2,2,2)
endif
end do
!$acc end parallel loop
end subroutine get_common
```

I am sorry for posting a condensed version, but I don’t know whether I can share the whole thing.

First, there are various weird problems. I have tested the whole code with various compilers and libraries such as gfortran, ifort, blas, mkl, openmp. The code works fine with all of them. If I use PGI it works fine as well. But as soon as I add openmp to pgi it gives different results.

Now coming to the above code. Again, it works fine if I don’t use the accelerator. But if I do, if you see I print devUflux(2,2,2) once outside and then inside the code. They give different results.

Also I came to this problem after repeatedly getting errors like,

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

which I may get to after solving the present problem.

Please help.