Openacc fortran module variables

Dear all,
I’m having issues handling module variables within a openacc fortran program. I’m not able to get correct results from a code which uses allocatable arrays from a module. (CPU code runs just fine).

As a last resort I tried to implement the example shown here , which behaves just like my code (CPU ok, GPU doesn’t work).

command line CPU:
nvfortran -Mpreprocess -Minfo=all static.f90 alloc.f90 main.f90 -o cpu_code
output CPU (correct) :
1.000000 1.000000
3.000000 3.000000

command line GPU:
nvfortran -acc -Mpreprocess -Minfo=all -gpu=lineinfo,managed static.f90 alloc.f90 main.f90 -o gpu_code
output GPU (wrong!) :
1.000000 1.000000
1.000000 1.000000

Source code:

FILE main.f90 :

program computer

    use staticmod
    implicit none 

    integer :: n
    integer :: i
    real iprocess
    
    n=2
    call allocit(n)

    xstat=1
    yalloc=1

    write(*,*) yalloc

    !$acc data copyin(yalloc,xstat)
    !$acc parallel loop
    do i = 1, n
        yalloc(i) = iprocess( i )
    enddo
    !$acc update self(yalloc)
    !$acc end data

    write(*,*) yalloc

end program    

real function iprocess( i )

        use staticmod

        implicit none 

        !$acc routine seq
        integer :: i

        iprocess = yalloc(i) + 2.*xstat(i)

end function

FILE static.f90 :

module staticmod
    implicit none 
    integer, parameter :: maxl = 100000
    real, dimension(maxl) :: xstat
    real, dimension(:), allocatable :: yalloc
    !$acc declare create(xstat,yalloc)
end module

FILE alloc.f90 :

subroutine allocit(n)

    use staticmod
    
    implicit none 

    integer :: n

    allocate( yalloc(n) )

end subroutine

thank you for your help!

Hi pietro,

A “declare” directive creates a data region whose scope and lifetime matches the scoping unit in which it’s declared. In this case, global scope (via the module) with a lifetime of the runtime of the program.

Using nested data regions as you do here, the “copy” clauses use “present_or” semantics… Meaning if the data is already present, no copy occurs. Since “xstat” is already present, it’s values are not getting updated on the device.

To fix, remove the nested data region and use the “update” directive instead:

    !$acc update device(xstat)
    !$acc parallel loop
    do i = 1, n
        yalloc(i) = iprocess( i )
    enddo
    !$acc update self(yalloc)

Note that you are using the the “-gpu=managed” flag so CUDA Unified Memory will be used. However, UM is only available for allocated data so only “yalloc” is managed given “xstat” is a fixed size static array so still needs to be managed via data regions.

Hope this helps,
Mat

Thank you so much Mat for your immediate support.

However, the situation is still not entirely clear (to me).

If I run your code (i.e. remove the data region and include an update device only for “xstat”) it works only if I compile using the -gpu=managed flag.

  1. Does that mean that using CUDA UM keeps any allocatable array always sinchronized between host and device? That would be undesirable, at least for my applications.

If I update both arrays ( !$acc update device(xstat, yalloc) ), whether or not I enclose the update in a data region, and no matter what compilation flag I use, I always get the correct results.

  1. finally, can you tell me if the !$acc declare create is the correct, only and most optimized way to handle module variables (either static or allocatable) in openacc fortran?

So, I may draw some (temporary) conclusions:

  • the example in the Nvidia Guide is still wrong.
  • my code was wrong because I thought that copyin would force an update if the variable is present. (shame on me, I should have read more carefully the Openacc reference guide: if the data in list is already present on the current device, the appropriate reference count is incremented and that copy is used. )

thank you!

If I run your code (i.e. remove the data region and include an update device only for “xstat”) it works only if I compile using the -gpu=managed flag.

Correct because the value of “yalloc” is read on the device so therefor needs to be updated to the device as well before entering the data region. With “managed”, it’s implicitly updated on the device. But it looks like you figured that out already.

  • the example in the Nvidia Guide is still wrong.

It’s not complete, in that the data movement would be handled elsewhere, but the example code snip-it is correctly illustrating the concept of allowing access to device data declared in one module can be used in another module.

my code was wrong because I thought that copyin would force an update if the variable is present.

“present_or” semantics does trip up new users so don’t feel bad. At one point I argued for a “present_and” version where the copy would be done if the variable was present, but the language committee didn’t like it since it may cause extra unintended data movement.

Personally I don’t use structured data regions any longer (except in a few cases). Instead opt for using unstructured data regions (enter/exit) close to where the arrays are declared, or just after they are allocated. I’ll only use “declare” when there’s direct access to a module variable from within a device subroutine. I then use “update” directives to manage the data movement as well as adding “default(present)” on my compute regions so if I forget to add a variable to a data region, the program aborts at run time and I can fix the error.

Thanks for the accurate explanation.

For some reasons that I can’t (still) figure out, I am not able to replicate the correct behavior of the snip code in the code I’m currently working on… The only way I managed to get it working is by setting the sizes of the module arrays as parameters… (did I just turn it into a static array?) . I think the problem can be solved by using the enter/exit data declaration instead of the usual data structure, but unfortunately it is very poorly documented. Can you suggest any link where I can find examples?

EDIT: I found some examples, but the enter/exit data construct looks to me as a data/end data block. The only difference is that the enter and exit clauses can be placed in different files/procedures. am I right?

thanks

I’m assuming you’re still using the example above with the corrections? Or is the issue with a different example? The above example works fine for me. Please post if you have a different example.

This doesn’t quite make sense. The difference here is that when using a fixed size, static array, the device memory is allocated at the same time the device is initialized. With allocatable arrays, only the array descriptor is created on device initialization, with the device data allocated at the same time that it’s allocated on the host.

But this wouldn’t effect the scoping of the device data or when/how to synchronize the device and host memories.

I think the problem can be solved by using the enter/exit data declaration instead of the usual data structure,

A “declare” directive and an unstructured data region are basically the same semantically, with only the scoping and lifetime of the device data being different. So while you certainly can use “enter/exit” data directive, it shouldn’t matter.

Then again, I’m not sure what the core issue is given the original example with the edits works fine for me.

The only difference is that the enter and exit clauses can be placed in different files/procedures. am I right?

Yes. A structured region must have a defined start and stopping place within the same scoping unit, while unstructured data regions (enter/exit) can be in separate scoping units.