fortran module variable

I’ve been trying to get OpenACC working on an f2py fortran extension I want to call from python. The extension compiles and runs correctly without -acc. Everything compiles OK with -acc (-Minfo showing parallelized loops etc) but when I call the extension I get

test.py Failing in Thread:1
call to cuModuleGetGlobal returned error 400: Invalid handle

After much effort I think I’ve isolated the problem, and I don’t think it is in f2py - it is either in PGI or in my understanding!

The following module (call it version 1) can be run by itself and gives correct results, and can also be called from python:

module useit
	real, dimension(:), allocatable :: y
contains

    subroutine doit()
        integer :: i
        allocate(y(4))
        y = 1
        !$acc kernels loop
        do i = 1, 4
            y(i) = y(i) + 1
        enddo
        print *, y
    end subroutine doit

end module useit

program testit
    use useit
    call doit()
end program testit

This code (call it version 2) is closer to the structure of mycode and compiles fine, but gives incorrect results (returning {1,1,1,1} when it should be {2,2,2,2}) and throws the cuModuleGetGlobal error when I try to call it from Python:

module useit

    real, dimension(:), allocatable :: y
    !$acc declare create(y)
    
contains

    subroutine doit()
        integer :: i
        allocate(y(4))
        y = 1
        !$acc kernels loop
        do i = 1, 4
            call addit(i)
        enddo
        print *, y
    end subroutine doit

    subroutine addit(i)
        !$acc routine
        integer :: i
        y(i) = y(i) + 1
    end subroutine addit

end module useit

program testit
    use useit
    call doit()
end program testit

The problem seems to be with the !$acc declare create(y) statement, since adding that line to version 1 causes it to fail the same way as version 2:

module useit
    real, dimension(:), allocatable :: y
    !$acc declare create(y)
contains

    subroutine doit()
        integer :: i
        allocate(y(4))
        y = 1
        !$acc kernels loop
        do i = 1, 4
            y(i) = y(i) + 1
        enddo
        print *, y
    end subroutine doit

end module useit

program testit
    use useit
    call doit()
end program testit

Is there something wrong with the way I am using the declare statement?

Hi Ciaran Harman,

There’s two different issues here.

Is there something wrong with the way I am using the declare statement?

Yes, in that by adding “y” to a declare directive, the compiler will no longer implicitly copy it and from the device. Hence you need to add update directives in order to syncronize the host and device copies.

subroutine doit()
integer :: i
allocate(y(4))
y = 1
!$acc update device(y)
!$acc kernels loop
do i = 1, 4
call addit(i)
enddo
!$acc update self(y)
print *, y
end subroutine doit

For the runtime error when calling from Python, I’m assuming you’ve built a shared object which get called by your Python program? If so, try adding the flag “-ta=tesla:nordc” to the compilation. This disable relocatable device code (RDC) generation which requires a link step, though does mean that you can no longer make calls to device routines found in other source files nor use device modules from external modules.

Also, which PGI compiler version are you using and how are you building the shared object? Later PGI versions do have some limited support for RDC within a shared object provided you’re using the PGI compiler to create the shared object (via the -shared option) and with OpenACC enabled.

-Mat

Thanks for the quick reply. Including the ‘update’ lines did help: compiling the code into a stand-alone (using pgfortran -o testit testit.f90 -acc) now produces the correct result. However that alone doesn’t fix the runtime error when calling from Python.

Adding “-ta=tesla:nordc” results in two different runtime errors occurring:

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

If it helps, I am building the extension using distutils. My setup.py file looks like this:

from distutils import util

import numpy
from numpy.distutils.core import Extension
from numpy.distutils.core import setup

config = {
    'author': 'me, myself',
    ...
    stuff
    ...
    'ext_modules': [Extension(name='solve', sources=[util.convert_path('./modulename/submodulename/extensionname.f90')],
                              include_dirs=[numpy.get_include()],
                              extra_f90_compile_args=["-fast", '-acc', '-Minfo', '-ta=tesla:nordc'],
                              extra_link_args=['-acc'],
                              libraries=None)],
}

setup(**config, requires=['pandas', 'numpy', 'scipy', 'matplotlib'])

The ‘extra_link_args’ are used in the linking step (pgfortran -shared -fpic -acc …), and Python throws an ImportError if -acc isn’t included.

(BTW, I have managed to rearrange my code so that I don’t need any function calls – the result isn’t pretty, but it works. I’d like to get to the bottom of this for future reference though)

nordc did fix the invalid handle error as expected, with the illegal memory address error being a new unrelated problem. This is similar to a segmentation violation on the host, where a bad address is being accessed on the device.

Are you still just using the simple “doit” routine or are there differences in what you’re using when called from Python?

If you’re using the “doit” version above, I’m assuming you’re using the version with the “declare” directive? This is problematic with nordc since global references require a device link step which is skipped when nordc is used. Try using the first version where “y” is copied as part of the compute region.

If you want the data to be persistent on the device, I would suggest adding a few more routines to control data movement. Something like:

module useit
	real, dimension(:), allocatable :: y
contains

   subroutine initdata(val)
        real :: val
        allocate(y(4))
        y = val
!$acc enter data copyin(y)
     end subroutine initdata

   subroutine deletedata()
        deallocate(y)
!$acc exit data delete(y)
     end subroutine deletedata

   subroutine updateself()
!$acc update self(y)
     end subroutine updateself

  subroutine updatedevice()
!$acc update device(y)
     end subroutine updatedevice

   subroutine printdata()
         print *, y
     end subroutine printdata

    subroutine doit()
        integer :: i       
!$acc kernels loop
        do i = 1, 4
            y(i) = y(i) + 1
        enddo
    end subroutine doit

end module useit

program testit
    use useit
    call initdata(1.)
    call doit()
    call updateself()
    call printdata()
end program testit

-Mat