Routine call with Derived Type

I am trying to put a big loop onto the GPU.

It is something like this

do tmp_index = 1, Ninterior_faces
!do loop
!another do loop
call acc_evaluate_interaction_flux(DIM, P1, Nvar, 
              mesh%normals(glb_face_index, :))

end do

mesh is a derived type. First I only parallelized the do loops. The function call has many more arguments but I narrowed down the error to when I pass the derived type.

Running this gives me the error

[gn012:11106] *** Process received signal ***
[gn012:11106] Signal: Segmentation fault (11)
[gn012:11106] Signal code:  (128)
[gn012:11106] Failing at address: (nil)
[gn012:11106] [ 0] /lib64/libpthread.so.0[0x316ee0f710]
[gn012:11106] [ 1] /usr/local/pgi-2016/linux86-64/16.5/lib/libpgc.so(__c_mcopy8+0x10e)[0x7fffe7bf7d8e]
[gn012:11106] *** End of error message ***

cuda-gdb tells me


Program received signal SIGSEGV, Segmentation fault.
0x00007fffe7bf7d8e in __c_mcopy8 () from /usr/local/pgi-2016/linux86-64/16.5/lib/libpgc.so

Now if instead I do

normals =  mesh%normals(glb_face_index, :)
call acc_evaluate_interaction_flux(DIM, P1, Nvar, 
              normals)

the error goes away. So is this a known shortcoming, or am I doing something wrong.

Also, there’s another problem, if I may ask it here.

The subroutine goes something like this.

    subroutine acc_evaluate_interaction_flux(DIM, p1, nvar, &
      nminus, pnminus, &
      uminus, uplus,  &
      Fdminus, Fdplus, Fvminus, Fvplus, &
      Gdminus, Gdplus, Gvminus, Gvplus, &
      Hdminus, Hdplus, Hvminus, Hvplus, &
      Fi, Gi, Hi, &
      Fv, Gv, Hv, &
      gamma, viscous_prefactor, face_type, &
      ilambda, ibeta_viscous, itau, &
      glb_face_index)

      !$acc routine 
      use input_module, only: ldg_tau, ldg_beta

!declarations

     lambda = HALF;           if (present(ilambda)) lambda = ilambda
     beta_viscous = ldg_beta; if (present(ibeta_viscous)) beta_viscous = ibeta_viscous
     tau_penalty = ldg_tau;   if (present(itau)) tau_penalty = itau

!more loops

Now, ldg_tau, ldg_beta are defined in another module. So it gave me acc create errors. So I went to the other module and did

real(c_double),     save, public :: vdiff  = ZERO
  real(c_double),     save, public :: ldg_beta   = HALF
  real(c_double),     save, public :: ldg_tau    = TENTH


!$acc declare create(ldg_tau, ldg_beta, vdiff)

Unfortunately as soon as I do this, the code immediately exits, giving me

call to cudaGetSymbolAddress returned error 13: Other

Hi Vsingh,

The seg fault is occurring on the host where it’s trying dereference a null pointer. Why this is occurring is unclear.

Is acc_evaluate_interaction_flux being called from an OpenACC compute region or from the host code?

Note that Fortran optional arguments are not yet supported with OpenACC routines. Are you using optionals?

  • Mat

Hi Mat,

Thanks for the reply. Yes, it is called from inside an OpenACC loop.

!$acc parallel loop
do tmp_index = 1, Ninterior_faces 
!do loop 
!another do loop 
call acc_evaluate_interaction_flux(DIM, P1, Nvar, 
              mesh%normals(glb_face_index, :)) 

end do 
!$acc end parallel

No, normals is not optional.

Can you please also comment on the second part, regarding acc declare.

Does acc_evaluate_interaction_flux have optionals? If not, then why the present checks and discrepancy between the number of arguments, called with 4 but expects 32?

Can you please also comment on the second part, regarding acc declare.

I’m not sure why this is occurring but most likely by not having the “declare create”, the compiler did not generate the device code. By adding it, the compiler is getting further.

Now the problem could be the use of optionals. Another possibility could be that you’re compiling with the “-ta=tesla:nordc” option. Without RDC, you can’t have global symbols nor make device routine calls.

Having a reproducible example would be very helpful.

  • Mat

I reduced the number of variables down to the variable that was causing the error. That is why its only 4.

I will try to generate a reproducing example and post it.

In the second case, its not a compiler error. The code is being compiled. But on running it exits almost immediately.

And I am not compiling with nordc.

Hi Mat,

I have put the code in

https://bitbucket.org/vsingh001/deepfry/src/a8e179eed183?at=gTest

You have to go to the folder

deepfry/examples/CNS/tgv

and do

make COMP=PGI ACC=t DIM3=t

and then do

./main.Linux.PGI.debug.acc.atlas.3d.exe inputs_3d

The relevant files that these errors are emanating from are in

src/cns/system.f90
src/cns/flux.f90
src/common/input.f90

If you comment out lines 1139, 1140, 1141 in system.f90 you can see the first error.

Hi Vsignh,

The “cudaGetSymbolAddress” error is due to your use of “optional” in the device routine. Removing the “optional” attribute and “present” tests eliminate the problem.

The seg fault is a problem with the temp array descriptor that gets created in order to pass in the subarrays to the device routine. The work around is to pass in the whole array, and then use “glb_face_index” to index the first dimension.

After this, I get a illegal address on the device. This is being caused by how you’re managing “mesh”. In several spots you create and then delete “mesh” and some of it’s arrays. However, you missed deleting it one spot thus causing an inconsistency. My suggestion would be to create “mesh” and each of the member arrays once at the same time that you allocate the arrays. Then use the “update” directive to synchronize the data. This will make things easier as you port more of the code and reduce the number of directive you need to add.

Note that your device routine contains several small automatic arrays. Automatic arrays require every thread to allocate device memory which can be very slow. If you can, try and make these fixed size so they don’t need to be allocated and deallocated every time the device routine is called.

I’ll email you the changes I made.

  • Mat

Hi Mat,

Sorry for the late reply.

Thanks so much for the extensive help and guidelines.

I received the mail as well. What I did was replace the 3 files into the code.

Unfortunately I get the following error.

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

I tried to do the changes line by line and I found that the error starts occurring in flux.f90 at the line

      My_minus  = uminus(:,3)
      My_plus   = uplus(:, 3)

Yes, the mesh thing is a mess. I am converting subroutines one by one which often leads to mistakes. Will try to be more careful. Thanks.

I am still not very comfortable with acc routine. How are variables being created when I am not doing any acc enter data create etc.

Unfortunately, in the case of automatic arrays, the size of the arrays are decided by the input file and can vary widely. Will dynamic arrays sort out the problem?

Thanks again.

Unfortunately, in the case of automatic arrays, the size of the arrays are decided by the input file and can vary widely. Will dynamic arrays sort out the problem?

Do you mean changing your automatics to allocatable arrays in your OpenACC “routines”? No, this wont help. Automatics are dynamic, they’re just implicitly allocated. Allocatables would just make the allocation explicit.

  • Mat

Ok.

Can you please also comment about this new error.

Thanks.

So, now I am seeing two problems.

I made the routine into

acc routine vector

and for some reason the error went away, but I have another issue.

But we have access to older M2090 and the code was refusing to compile. I was getting the error,

PGF90-S-1001-All selected compute capabilities were disabled

I read somewhere that reductions don’t work within routine directives. I am guessing this is an older hardware issue, since it compiles for newer Kepler GPU’s.

So can you please help on both issues.

Thanks.

I read somewhere that reductions don’t work within routine directives. I am guessing this is an older hardware issue, since it compiles for newer Kepler GPU’s.

Correct. Reductions in OpenACC “vector” and “worker” loops within a device routine is only supported on devices using CC3.0 or newer. Reductions in this scenario require a shuffle instruction which isn’t available on older devices.

  • Mat

Thanks.

I decided to change the whole subroutine, so that it can work for both types of hardware.

Thanks for the previous suggestions as well.