OPENACC changes value of array

vsingh96824 · January 13, 2016, 3:18pm

I am posting for the first time so I apologize in advance for any mistakes I do.

I am posting a version of a code I am trying to accelerate using openacc.

subroutine get_common(mesh, beta)
 type(triangle_mesh), intent(inout) :: mesh
    real(c_double), optional,       intent(in   ) :: beta

    ! locals
    integer(c_int) :: j, k, glb_face_index, left_face_index, rght_face_index, left, rght
    integer(c_int) :: elem2face_left(mesh%elem_num_faces), elem2face_rght(mesh%elem_num_faces)
    integer(c_int) :: left_indices(2), rght_indices(2)
    integer(c_int) :: face2elem(2)

    real(c_double) :: ibeta
    real(c_double) :: uflux_left(mesh%P1, mesh%Nvar), uflux_rght(mesh%P1, mesh%Nvar)
    real(c_double) :: uflux_tmp(mesh%P1, mesh%Nvar), uflux_avg(mesh%P1, mesh%Nvar)


    integer(c_int) :: Nfaces, elem_num_faces, P1, Nvar
    real(c_int) ::  devFace2elem(mesh%Nfaces, 2), devElem2face(mesh%Nelements, mesh%elem_num_faces)
    real(c_double) :: devUflux(mesh%Nflux, mesh%Nvar, mesh%Nelements), devUcommon(mesh%Nflux, mesh%Nvar, mesh%Nelements)
    real(c_double) :: devtemp(2)

    Nfaces = mesh%Nfaces
    Nvar = mesh%Nvar
    P1 = mesh%P1
    elem_num_faces = mesh%elem_num_faces
    devFace2elem = mesh%face2elem
    devElem2face = mesh%elem2face
    devUflux = mesh%uflux
    devUcommon = mesh%ucommon

    uflux_left = 0
    write(*,*) 'CPU', devUflux(2,2,2)
    ibeta = HALF; if (present(beta)) ibeta = beta
    !$acc parallel loop copyin(devUflux(1:mesh%Nflux, 1:mesh%Nvar, 1:mesh%Nelements))
    do glb_face_index = 1, Nfaces
      face2elem = devFace2elem(glb_face_index,:)
      if ( .not. (any(face2elem .eq. 0)) ) then
        left = face2elem(1); rght = face2elem(2)
        elem2face_left(:) = devElem2face(left, :)
        elem2face_rght(:) = devElem2face(rght, :)
        do j = 1, elem_num_faces
           if( elem2face_left(j) .eq. glb_face_index) left_face_index = j
           if( elem2face_rght(j) .eq. glb_face_index) rght_face_index = j
        end do

        left_indices(1) = P1*(left_face_index-1)+1
        left_indices(2) = P1*left_face_index

        rght_indices(1) = P1*(rght_face_index-1)+1
        rght_indices(2) = P1*rght_face_index
        
        write(*,*) devUflux(2,2,2)
        endif
    end do
    !$acc end parallel loop 
end subroutine get_common

I am sorry for posting a condensed version, but I don’t know whether I can share the whole thing.

First, there are various weird problems. I have tested the whole code with various compilers and libraries such as gfortran, ifort, blas, mkl, openmp. The code works fine with all of them. If I use PGI it works fine as well. But as soon as I add openmp to pgi it gives different results.

Now coming to the above code. Again, it works fine if I don’t use the accelerator. But if I do, if you see I print devUflux(2,2,2) once outside and then inside the code. They give different results.

Also I came to this problem after repeatedly getting errors like,

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

which I may get to after solving the present problem.

Please help.

MatColgrove · January 13, 2016, 6:21pm

Hi vsingh96824,

I am sorry for posting a condensed version, but I don’t know whether I can share the whole thing.

No worries, but is this bit of code complete? I ask because there’s no assignment and your not printing anything computed in the loop.

Can you post the compiler feedback messages (-Minfo=accel)?

My guess you’re going to see messages indicating the loop isn’t parallelizable due to dependencies. For both OpenMP and OpenACC, arrays are shared by default so every iteration of the “glb_face_index” will be using the same elem2face_left/right and left/right_indices array. If parallelized, this would cause a race condition. To fix, you would need to add these arrays to a “private” clause so that each thread gets it’s own copy.

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Does mesh contain an allocatable array and are you trying to access this array in a compute region? If so, be sure to copy mesh itself in a data region before putting in the member array.

Other causes are using host pointers on the device, using too much device allocated data (such as if you’re calling a routine with automatic arrays), or out-of-bounds array accesses.

Mat

vsingh96824 · January 14, 2016, 8:33am

Hello Mat.

Thanks for the reply. I have edited the post, to include the full subroutine. I hope that will allow a better understanding.

Here is the Minfo output.

get_common:
    221, Generating copyin(devuflux(:mesh%nflux,:mesh%nvar,:mesh%nelements))
         Accelerator kernel generated
         Generating Tesla code
        222, !$acc loop gang ! blockidx%x
        223, !$acc loop vector(128) ! threadidx%x
        226, !$acc loop vector(128) ! threadidx%x
        241, !$acc loop vector(128) ! threadidx%x
        243, !$acc loop vector(128) ! threadidx%x
    221, Generating copyin(devface2elem(:nfaces,:))
         Generating copyout(face2elem(:))
         Generating copy(rght_indices(:),left_indices(:))
         Generating copyin(develem2face(:,:))
         Generating copy(elem2face_left(:),elem2face_rght(:),uflux_tmp(:,:))
    223, Loop is parallelizable
    224, Scalar last value needed after loop for face2elem$r at line 224
    226, Loop is parallelizable
    228, Scalar last value needed after loop for left_face_index at line 233
         Scalar last value needed after loop for left_face_index at line 234
         Scalar last value needed after loop for rght_face_index at line 236
         Scalar last value needed after loop for rght_face_index at line 237
         Accelerator restriction: scalar variable live-out from loop: rght_face_index,left_face_index
    241, Loop is parallelizable
    243, Loop is parallelizable

For reference 221 is the line number of the acc directive.

The code obviously was doing computations, but then after getting the errors, I commented out more and more lines until I figured out which line started giving errors. Then I put the write statements to compare devUflux(2,2,2) inside and outside the loop.
As I said, they give different outputs.

Regarding the OpenMP thing, as I said, it is only with pgfortran that I get a different output. With gfortran or ifort, I get correct output.

I will try to do an you advised. But please do see if the above information gives you more ideas.

MatColgrove · January 14, 2016, 11:24pm

Generating copyout(face2elem(:))
Generating copy(rght_indices(:),left_indices(:))
Generating copyin(develem2face(:,:))
Generating copy(elem2face_left(:),elem2face_rght(:),uflux_tmp(:,:))

These messages indicate that these arrays are shared amongst all threads. This causes a race condition. To fix, you’ll need to put these into a private clause on the parallel directive.

I’d also recommend adding “gang vector” to the parallel directive. Since the “j” loop isn’t parallelizable, it’s probably better that the compiler not parallelize the implicit loops from array syntax.

Regarding the OpenMP thing, as I said, it is only with pgfortran that I get a different output. With gfortran or ifort, I get correct output.

Do you really mean OpenACC with PGI? or are you getting wrong answers with a PGI compiled OpenMP code that works with the other compilers?

Mat

vsingh96824 · January 17, 2016, 9:29am

Thanks so much for the response Mat. Will try out the methods you mentioned.

Do you really mean OpenACC with PGI? or are you getting wrong answers with a PGI compiled OpenMP code that works with the other compilers?

The whole code is done with OpenMP. So we tested with ifort, gfortran and they agree on results. I tried the same thing with PGI and it gave different results. And it is only with OpenMP that it gives different results. Without OpenMP, the PGI compiler gives same results as ifort and gfortran.

I am trying to compare OpenMP with OpenACC, and this subroutine is the first one I have tried to replace.

MatColgrove · January 18, 2016, 5:40pm

And it is only with OpenMP that it gives different results. Without OpenMP, the PGI compiler gives same results as ifort and gfortran.

How different? Running in parallel may give different results due to the order of operations. Also, a slight change in one variable can have a cascading effect if used in a reduction variable.

Not to say that it couldn’t be a compiler issue, but more likely it’s a numerical accuracy issue in the code.

Mat

vsingh96824 · January 19, 2016, 12:46pm

So your suggestions removed the errors, but I have narrowed down the problem to this. I have reduced the code to this,

subroutine get_common(mesh, beta)
    type(triangle_mesh), intent(inout) :: mesh
    real(c_double), optional,       intent(in   ) :: beta

    ! locals
    integer(c_int) :: j, k, glb_face_index, left_face_index, rght_face_index, left, rght
    integer(c_int) :: elem2face_left(mesh%elem_num_faces), elem2face_rght(mesh%elem_num_faces)
    integer(c_int) :: left_indices(2), rght_indices(2)
    integer(c_int) :: face2elem(2)

    real(c_double) :: ibeta
    real(c_double) :: uflux_left(mesh%P1, mesh%Nvar), uflux_rght(mesh%P1, mesh%Nvar)
    real(c_double) :: uflux_tmp(mesh%P1, mesh%Nvar), uflux_avg(mesh%P1, mesh%Nvar)


    integer(c_int) :: Nfaces, elem_num_faces, P1, Nvar
    real(c_int) ::  devFace2elem(mesh%Nfaces, 2), devElem2face(mesh%Nelements, mesh%elem_num_faces)
    real(c_double) :: devUflux(mesh%Nflux, mesh%Nvar, mesh%Nelements), devUcommon(mesh%Nflux, mesh%Nvar, mesh%Nelements)
    real(c_double) :: devtemp(2)

    ibeta = HALF; if (present(beta)) ibeta = beta
    Nfaces = mesh%Nfaces
    devFace2elem = mesh%face2elem
    devUflux = mesh%uflux

    !$acc parallel loop copyin(devUflux)
    do glb_face_index = 1, Nfaces
      write(*,*) 'GPU', devUflux(2,2,2)
    end do
    !$acc end parallel loop 

    do glb_face_index = 1, Nfaces
    write(*,*) 'CPU', devUflux(2,2,2)
    end do
end subroutine get_common

The output is

 GPU   0.8888999999999868     
 GPU   0.8888999999999868     
 GPU   0.8888999999999868     
 GPU   0.8888999999999868     
 GPU   0.8888999999999868     
 GPU   0.8888999999999868     
 GPU   0.8888999999999868     
 GPU   0.8888999999999868     
 GPU   0.8888999999999868

 CPU   0.9999999999999868     
 CPU   0.9999999999999868     
 CPU   0.9999999999999868     
 CPU   0.9999999999999868     
 CPU   0.9999999999999868     
 CPU   0.9999999999999868     
 CPU   0.9999999999999868     
 CPU   0.9999999999999868     
 CPU   0.9999999999999868

And this is happening in general, that is, for all values that are 0.9999 something, it seems to be replacing them with 0.8888 something.

MatColgrove · January 19, 2016, 5:08pm

Hi vsingh96824,

This doesn’t make much sense since all you’re doing is copying data in and printing it. If “devUflux” was in a higher level data region, then you’d most likely just need to update the data since “copyin” wont copy data is it’s already present on the device. But this is a local variable.

Can you please create a reproducing example and either post or send to PGI Customer Service (trs@pgroup.com)?

Thanks,
Mat

vsingh96824 · January 26, 2016, 10:56am

Hi Mat,

Sorry for the late reply.

I have posted the relevant testing code in

https://bitbucket.org/vsingh001/deepfry/src/4ed6ca140779?at=accTest

I am running

examples/CNS/advection

Please tell me if I can give more details, here or over email.

MatColgrove · January 26, 2016, 8:20pm

Thanks vsingh96824.

I was able to reproduce the error and believe it’s a numerical conversion error in the “print” routine on the GPU. The data on the GPU is correct since the sum of the CPU and GPU value are the same. It’s just the print that’s incorrect.

In particular, it appears to be incorrect for hex values ‘3FEFFFFFFFFDxxxx’,‘3FEFFFFFFFFExxxx’, or ‘3FEFFFFFFFFFxxxx’ where ‘x’ can be any hex value.

I have added a problem report, TPR#22238, and sent it to engineering for further investigation. Unfortunately, there isn’t a work around other than to not print the values and instead sum them to ensure the data is correct.

Best Regards,
Mat

Here’s a small reproducing example:

% cat foo.f90
program foo
  real(8) :: arr(16)
  arr(1)=Z'3FEFFFFFFFFD0000'
  arr(2)=Z'3FEFFFFFFFFE0000'
  arr(3)=Z'3FEFFFFFFFFF0000'
  arr(4)=Z'3FEFFFFFFFFF4100'
  arr(5)=Z'3FEFFFFFFFFF5A00'
  arr(6)=Z'3FEFFFFFFFFD6EE0'
  arr(7)=Z'3FEFFFFFFFFF7360'
  arr(8)=Z'3FEFFFFFFFFE88A1'
  arr(9)=Z'3FEFFFFFFFFF9999'
  arr(10)=Z'3FEFFFFFFFFDAAAA'
  arr(11)=Z'3FEFFFFFFFFDBCBC'
  arr(12)=Z'3FEFFFFFFFFFCC10'
  arr(13)=Z'3FEFFFFFFFFDDDDD'
  arr(14)=Z'3FEFFFFFFFFFEEEE'
  arr(15)=Z'3FEFFFFFFFFFF100'
  arr(16)=Z'3FEFFFFFFFFEFF00'
  call bar(arr,16)

end program foo

subroutine bar(arr,N)
  real(8) arr(*)
  integer N
  real(8) :: sum1, sum2

  sum1 = 0.0
  sum2 = 0.0
  do i=1,N
    print *, 'CPU ', arr(i)
    sum1=sum1+arr(i)
  enddo
  print *, "CPU SUM: ", sum1
  print *, "=================="
  !$acc parallel loop copyin(arr(1:N)) reduction(+:sum2)
  do i=1,N
    print *, 'GPU ', arr(i)
    sum2=sum2+arr(i)
  enddo
  print *, "GPU SUM: ", sum2

end subroutine bar
% pgf90 foo.f90 -acc -V15.10 -Minfo=accel ; a.out
bar:
     36, Generating copyin(arr(:n))
         Accelerator kernel generated
         Generating Tesla code
         36, Sum reduction generated for sum2
         37, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
 CPU    0.9999999999781721
 CPU    0.9999999999854481
 CPU    0.9999999999927240
 CPU    0.9999999999945715
 CPU    0.9999999999952820
 CPU    0.9999999999813234
 CPU    0.9999999999960032
 CPU    0.9999999999893313
 CPU    0.9999999999970896
 CPU    0.9999999999830227
 CPU    0.9999999999835363
 CPU    0.9999999999985238
 CPU    0.9999999999844779
 CPU    0.9999999999995148
 CPU    0.9999999999995737
 CPU    0.9999999999926956
 CPU SUM:     15.99999999985129
 ==================
 GPU    0.8999999999781721
 GPU    0.8999999999854481
 GPU    0.8999999999927240
 GPU    0.8999999999945715
 GPU    0.8999999999952820
 GPU    0.8999999999813234
 GPU    0.8999999999960032
 GPU    0.8999999999893313
 GPU    0.8999999999970896
 GPU    0.8999999999830227
 GPU    0.8999999999835363
 GPU    0.8899999999985238
 GPU    0.8999999999844779
 GPU    0.8899999999995148
 GPU    0.8899999999995737
 GPU    0.8999999999926956
 GPU SUM:     15.99999999985129

vsingh96824 · January 27, 2016, 9:30am

Wow, thanks Mat for such a quick resolution.

I will test based on what you mention.

Vikram

MatColgrove · January 27, 2016, 4:39pm

Hi Vikram,

We were able to create a patch file which fixes the problem. Please send a note to PGI Customer Service (trs@pgroup.com) and we can send you the updated file.

Mat

brentl · May 17, 2016, 4:41pm

This was fixed in PGI 16.3 and above.

Topic		Replies	Views
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2729	December 4, 2023
OpenACC diff between GPU + CPU codes Legacy PGI Compilers	5	4100	May 31, 2012
zero output when OpenACC is used Legacy PGI Compilers	3	1823	March 19, 2018
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	11038	November 27, 2013
Need advice for OpenACC directives Legacy PGI Compilers	6	7378	July 5, 2016
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	2076	November 30, 2020
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	209	December 28, 2024
paralellize some loops with omp + acc Legacy PGI Compilers	16	6958	March 19, 2018
OpenACC for code acceleration Legacy PGI Compilers	13	10807	November 6, 2017
Parallelizing a loop Legacy PGI Compilers	9	5569	March 1, 2016

OPENACC changes value of array

Related topics