Routine call with Derived Type

vsingh96824 · July 31, 2016, 10:22am

I am trying to put a big loop onto the GPU.

It is something like this

do tmp_index = 1, Ninterior_faces
!do loop
!another do loop
call acc_evaluate_interaction_flux(DIM, P1, Nvar, 
              mesh%normals(glb_face_index, :))

end do

mesh is a derived type. First I only parallelized the do loops. The function call has many more arguments but I narrowed down the error to when I pass the derived type.

Running this gives me the error

[gn012:11106] *** Process received signal ***
[gn012:11106] Signal: Segmentation fault (11)
[gn012:11106] Signal code:  (128)
[gn012:11106] Failing at address: (nil)
[gn012:11106] [ 0] /lib64/libpthread.so.0[0x316ee0f710]
[gn012:11106] [ 1] /usr/local/pgi-2016/linux86-64/16.5/lib/libpgc.so(__c_mcopy8+0x10e)[0x7fffe7bf7d8e]
[gn012:11106] *** End of error message ***

cuda-gdb tells me

Program received signal SIGSEGV, Segmentation fault.
0x00007fffe7bf7d8e in __c_mcopy8 () from /usr/local/pgi-2016/linux86-64/16.5/lib/libpgc.so

Now if instead I do

normals =  mesh%normals(glb_face_index, :)
call acc_evaluate_interaction_flux(DIM, P1, Nvar, 
              normals)

the error goes away. So is this a known shortcoming, or am I doing something wrong.

Also, there’s another problem, if I may ask it here.

The subroutine goes something like this.

    subroutine acc_evaluate_interaction_flux(DIM, p1, nvar, &
      nminus, pnminus, &
      uminus, uplus,  &
      Fdminus, Fdplus, Fvminus, Fvplus, &
      Gdminus, Gdplus, Gvminus, Gvplus, &
      Hdminus, Hdplus, Hvminus, Hvplus, &
      Fi, Gi, Hi, &
      Fv, Gv, Hv, &
      gamma, viscous_prefactor, face_type, &
      ilambda, ibeta_viscous, itau, &
      glb_face_index)

      !$acc routine 
      use input_module, only: ldg_tau, ldg_beta

!declarations

     lambda = HALF;           if (present(ilambda)) lambda = ilambda
     beta_viscous = ldg_beta; if (present(ibeta_viscous)) beta_viscous = ibeta_viscous
     tau_penalty = ldg_tau;   if (present(itau)) tau_penalty = itau

!more loops

Now, ldg_tau, ldg_beta are defined in another module. So it gave me acc create errors. So I went to the other module and did

real(c_double),     save, public :: vdiff  = ZERO
  real(c_double),     save, public :: ldg_beta   = HALF
  real(c_double),     save, public :: ldg_tau    = TENTH


!$acc declare create(ldg_tau, ldg_beta, vdiff)

Unfortunately as soon as I do this, the code immediately exits, giving me

call to cudaGetSymbolAddress returned error 13: Other

MatColgrove · August 1, 2016, 4:34pm

Hi Vsingh,

The seg fault is occurring on the host where it’s trying dereference a null pointer. Why this is occurring is unclear.

Is acc_evaluate_interaction_flux being called from an OpenACC compute region or from the host code?

Note that Fortran optional arguments are not yet supported with OpenACC routines. Are you using optionals?

Mat

vsingh96824 · August 1, 2016, 6:11pm

Hi Mat,

Thanks for the reply. Yes, it is called from inside an OpenACC loop.

!$acc parallel loop
do tmp_index = 1, Ninterior_faces 
!do loop 
!another do loop 
call acc_evaluate_interaction_flux(DIM, P1, Nvar, 
              mesh%normals(glb_face_index, :)) 

end do 
!$acc end parallel

No, normals is not optional.

Can you please also comment on the second part, regarding acc declare.

MatColgrove · August 1, 2016, 6:31pm

Does acc_evaluate_interaction_flux have optionals? If not, then why the present checks and discrepancy between the number of arguments, called with 4 but expects 32?

Can you please also comment on the second part, regarding acc declare.

I’m not sure why this is occurring but most likely by not having the “declare create”, the compiler did not generate the device code. By adding it, the compiler is getting further.

Now the problem could be the use of optionals. Another possibility could be that you’re compiling with the “-ta=tesla:nordc” option. Without RDC, you can’t have global symbols nor make device routine calls.

Having a reproducible example would be very helpful.

Mat

vsingh96824 · August 1, 2016, 7:48pm

I reduced the number of variables down to the variable that was causing the error. That is why its only 4.

I will try to generate a reproducing example and post it.

In the second case, its not a compiler error. The code is being compiled. But on running it exits almost immediately.

And I am not compiling with nordc.

vsingh96824 · August 2, 2016, 9:51am

Hi Mat,

I have put the code in

https://bitbucket.org/vsingh001/deepfry/src/a8e179eed183?at=gTest

You have to go to the folder

deepfry/examples/CNS/tgv

and do

make COMP=PGI ACC=t DIM3=t

and then do

./main.Linux.PGI.debug.acc.atlas.3d.exe inputs_3d

The relevant files that these errors are emanating from are in

src/cns/system.f90
src/cns/flux.f90
src/common/input.f90

If you comment out lines 1139, 1140, 1141 in system.f90 you can see the first error.

MatColgrove · August 2, 2016, 8:24pm

Hi Vsignh,

The “cudaGetSymbolAddress” error is due to your use of “optional” in the device routine. Removing the “optional” attribute and “present” tests eliminate the problem.

The seg fault is a problem with the temp array descriptor that gets created in order to pass in the subarrays to the device routine. The work around is to pass in the whole array, and then use “glb_face_index” to index the first dimension.

After this, I get a illegal address on the device. This is being caused by how you’re managing “mesh”. In several spots you create and then delete “mesh” and some of it’s arrays. However, you missed deleting it one spot thus causing an inconsistency. My suggestion would be to create “mesh” and each of the member arrays once at the same time that you allocate the arrays. Then use the “update” directive to synchronize the data. This will make things easier as you port more of the code and reduce the number of directive you need to add.

Note that your device routine contains several small automatic arrays. Automatic arrays require every thread to allocate device memory which can be very slow. If you can, try and make these fixed size so they don’t need to be allocated and deallocated every time the device routine is called.

I’ll email you the changes I made.

Mat

vsingh96824 · August 4, 2016, 11:07am

Hi Mat,

Sorry for the late reply.

Thanks so much for the extensive help and guidelines.

I received the mail as well. What I did was replace the 3 files into the code.

Unfortunately I get the following error.

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

I tried to do the changes line by line and I found that the error starts occurring in flux.f90 at the line

      My_minus  = uminus(:,3)
      My_plus   = uplus(:, 3)

Yes, the mesh thing is a mess. I am converting subroutines one by one which often leads to mistakes. Will try to be more careful. Thanks.

I am still not very comfortable with acc routine. How are variables being created when I am not doing any acc enter data create etc.

Unfortunately, in the case of automatic arrays, the size of the arrays are decided by the input file and can vary widely. Will dynamic arrays sort out the problem?

Thanks again.

MatColgrove · August 4, 2016, 2:46pm

Unfortunately, in the case of automatic arrays, the size of the arrays are decided by the input file and can vary widely. Will dynamic arrays sort out the problem?

Do you mean changing your automatics to allocatable arrays in your OpenACC “routines”? No, this wont help. Automatics are dynamic, they’re just implicitly allocated. Allocatables would just make the allocation explicit.

Mat

vsingh96824 · August 4, 2016, 3:48pm

Ok.

Can you please also comment about this new error.

Thanks.

vsingh96824 · August 4, 2016, 4:59pm

So, now I am seeing two problems.

I made the routine into

acc routine vector

and for some reason the error went away, but I have another issue.

But we have access to older M2090 and the code was refusing to compile. I was getting the error,

PGF90-S-1001-All selected compute capabilities were disabled

I read somewhere that reductions don’t work within routine directives. I am guessing this is an older hardware issue, since it compiles for newer Kepler GPU’s.

So can you please help on both issues.

Thanks.

MatColgrove · August 8, 2016, 11:30pm

I read somewhere that reductions don’t work within routine directives. I am guessing this is an older hardware issue, since it compiles for newer Kepler GPU’s.

Correct. Reductions in OpenACC “vector” and “worker” loops within a device routine is only supported on devices using CC3.0 or newer. Reductions in this scenario require a shuffle instruction which isn’t available on older devices.

Mat

vsingh96824 · August 9, 2016, 8:45am

Thanks.

I decided to change the whole subroutine, so that it can work for both types of hardware.

Thanks for the previous suggestions as well.

Topic		Replies	Views
Call in OpenACC region to procedure 'pgf90_copy_f90_argl' Legacy PGI Compilers	10	11498	July 5, 2017
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20445	October 25, 2017
Using Fortran derived types and cuBLAS Legacy PGI Compilers	19	12261	June 24, 2016
17.10 seg faulting a code that worked in 17.9 Legacy PGI Compilers	20	15786	April 5, 2019
Openacc fortran acc routine error [nvlink error : undefined reference to 'subroutine_name_' in 'file_name'] Legacy PGI Compilers	5	1557	March 3, 2023
acc routine and Fortran Legacy PGI Compilers	6	14250	March 13, 2014
Parallelizing with a fortran routine Legacy PGI Compilers	4	3152	December 13, 2019
program got SIGSEGV on pgi_acc internal function call Legacy PGI Compilers	5	6323	July 16, 2013
From PGI 19.10 community edition to NVIDIA HPC 21.2: "call to cuStreamSynchronize returned error 700: Illegal address during kernel execution" nvc, nvc++ and nvfortran	7	591	March 8, 2021
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	2084	November 30, 2020

Routine call with Derived Type

Related topics