Fortran Shared Object called in python

Hi everyone,
I have a Fortran code that we successfully ported to GPU via openacc. The code is running correctly and nicely if compiled statically, but some problems arise when compiling the code to be a Shared Object (SO), thus using flags -fpic -gpu=nordc . I need to compile the code as a SO to be used (i.e., directly called) via ctypes in a python code, but I am always getting the same error:

FATAL ERROR: data in update device clause was not found on device 1

That is caused by an !$acc update device(…). Note that the same code is working well when I am compiling the Fortran code as a SO to run on the CPU. Any advice is highly appreciated.

Loriano

Hi Lorinano,

What variable in the update clause is causing the error and how are you first allocating it on the device? Are you using an “enter data” directive, or a “declare” directive?

Without Relocatable Device Code (RDC) enabled, module variables using “declare” aren’t supported. The problem being that these require a device link step which isn’t performed with “-gpu=nordc”.

In the past few years, we did add support for RDC in Fortran shared objects with OpenACC. So if you are using “declare” for module variables, you may try removing “-gpu=nordc” when creating the SO and see if that fixes the problem.

If that doesn’t help, please provide more details and if possible, a minimal reproducing example.

-Mat

Dear Mat,
Thanks indeed for your quick reply. Yes I am using “declare” for the given variables. I also just quickly tested the compilation removing -gpu=nordc , but I am still getting the same error:

FATAL ERROR: data in update device clause was not found on device 1: name=fm(:,:)

That is caused by the cited “!$acc update device(fm, t)”, where both “fm” and “t” are declared in a different module as public real arrays and allocated on the device via a “declare”:

real, public :: fm(ninter,maxl)
!$acc declare create(fm)
real, public :: t(ninter)
!$acc declare create(t)

I will be out of the office for a couple of days , and when I’m back I will try to prepare a code snippet reproducing the error.

Thanks again
Loriano

Sounds good. Jeff pinged me and should be able to get me the reproducer in a few days.

Ciao Mat,
thanks , I just pinged Jeff, as I am back I can prepare a test code , but I just wrote to know if it is needed or not

thanks again

Jeff was in the middle of a project last week so wasn’t able to send me the code as of yet. Might be easier for him, but I can direct message you my email if you want to send the test to me.

Hi Mat,
I should have been able to reproduce the issue using a quick toy-code you can find in the following repo:

You can compile it easily via a simple make (i.e., maybe you’ll need to change some flags in the config.mk):

$ make

nvfortran -acc=gpu -gpu=cc61,cuda12.1 -Minfo=accel -cuda -cudalib=cublas,cusolver -r8 -Minform=warn -Mextend -O3 -cudalib=cublas -fopenmp -fpic -DUSECUDANV -o funfm .o -c funfm.F

$ python3 ./pybertha.py

0.0 FATAL ERROR: data in update device clause was not found on device 1: name=fm(:,:) file:/home/redo/BERTHAGPU/pycudaf/gfinit.F gfinit line:18

While:

$ ./testb

0.000000000000000
1.0000000000000001E-005

Both testb and pybertha.py are relying on the same bertha_wrapper.so Shared Object (SO). In the python code the SO is loaded via a ctypes.cdll.LoadLibrary within the pybertha class implemented in berthamod.py.

I tried to reproduce the basic structure (i.e. code stack) as in the original pybertha code, via a quick cut and paste, hopefully I did not make any evident mistake, in case just let me know.

Thanks again
Loriano

Hi Loriano,

This is an interesting one, though seems like a very specific case. The problem seems to be limited to using fixed size module arrays in a declare directive. My best guess is that there is some type initialization issue when the SO is loaded in python that doesn’t occur when loaded by ld. While I don’t know if it’s relevant, strace shows mprotect getting called after loading the SO from python. Possibly this added memory protection is causing the issue. This is purely a guess, so I’ve created an issue report, TPR # 34208, and will let engineering determine the root cause and if it’s something we can fix or not.

The good news is that I have two work arounds for you.

The first is to remove the “declare” directives and instead use an “enter data” directive in the init routine. This will delay the device array creating until runtime as opposed to the library load.

The second is to instead change the arrays to be allocatable instead of fixed size and add an “allocate” in the init routine. The “declare” is still used. Again the device array creating is delayed until the array allocation.

Here’s the modified code with the work arounds:
pycudaf.tar (450 KB)

Case 1:

% make clean
% make EXTRA="-DUSE_CASE1"
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -r8 -Minform=warn -Mextend -O3 -cudalib=cublas   -fopenmp  -fpic  -DUSECUDANV -DUSE_CASE1   -o funfm.o -c funfm.F
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -r8 -Minform=warn -Mextend -O3 -cudalib=cublas   -fopenmp  -fpic  -DUSECUDANV -DUSE_CASE1   -o gfinit.o -c gfinit.F
gfinit:
     23, Generating enter data create(t(:),fm(:,:))
     25, Generating update device(t(:),fm(:,:))
nvcc -D_FILE_OFFSET_BITS=64 -O3   --compiler-options '-fopenmp' --compiler-options '-fPIC'   -o c_wrapper.o -c c_wrapper.c
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -r8 -Minform=warn -Mextend -O3 -cudalib=cublas   -fopenmp  -fpic  -DUSECUDANV -DUSE_CASE1   -o bertha_wrapper.o -c bertha_wrapper.F
nvfortran -shared -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -fopenmp funfm.o gfinit.o c_wrapper.o bertha_wrapper.o -o bertha_wrapper.so
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -r8 -Minform=warn -Mextend -O3 -cudalib=cublas   -fopenmp  -fpic  -DUSECUDANV -DUSE_CASE1   -o main.o -c main.F
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -fopenmp main.o -o testb bertha_wrapper.so
f% python3 ./pybertha.py
0.0
1e-05

Case 2:

% make clean
rm -f *.o *.mod *__genmod.f90 bertha_wrapper.so testb
% make EXTRA="-DUSE_CASE2"
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -r8 -Minform=warn -Mextend -O3 -cudalib=cublas   -fopenmp  -fpic  -DUSECUDANV -DUSE_CASE2   -o funfm.o -c funfm.F
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -r8 -Minform=warn -Mextend -O3 -cudalib=cublas   -fopenmp  -fpic  -DUSECUDANV -DUSE_CASE2   -o gfinit.o -c gfinit.F
gfinit:
     25, Generating update device(t(:),fm(:,:))
nvcc -D_FILE_OFFSET_BITS=64 -O3   --compiler-options '-fopenmp' --compiler-options '-fPIC'   -o c_wrapper.o -c c_wrapper.c
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -r8 -Minform=warn -Mextend -O3 -cudalib=cublas   -fopenmp  -fpic  -DUSECUDANV -DUSE_CASE2   -o bertha_wrapper.o -c bertha_wrapper.F
nvfortran -shared -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -fopenmp funfm.o gfinit.o c_wrapper.o bertha_wrapper.o -o bertha_wrapper.so
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -r8 -Minform=warn -Mextend -O3 -cudalib=cublas   -fopenmp  -fpic  -DUSECUDANV -DUSE_CASE2   -o main.o -c main.F
nvfortran -acc=gpu -Minfo=accel -cuda -cudalib=cublas,cusolver   -fopenmp main.o -o testb bertha_wrapper.so
% python3 ./pybertha.py
0.0
1e-05

-Mat

Dear Mat,
thanks indeed , I am replying only now as I am trying to test the work around also in the original code. In principle it should work, but I’m getting some linking problems that is indeed strange . BTW I’ll keep trying and I’ll let you know.

Hi Mat,
Thanks again for you help. The linking problem is strangely related to the missing “declare create” in the “funfm” module, it is indeed strange as via “nm” I see the symbol as defined, still , without going into details at the moment in the original code the situation can be usmmirized as follow:

1 - using method 1 (i.e. “enter data” approach) I need also to add a “declare create” in the module to avoid linking issues. Still, while the code compiles and runs the results are wrong. My guess is that the device update does not properly copy the data in the GPU.

2 - using method 2, (i.e., allocate , so the heap instead of stack ) I am able to compile the code but I get a runtime error, that is: “Accelerator Fatal Error: call to cuStreamSynchronize returned error 700: Illegal address during kernel execution” , apparently the data is not allocated in the GPU.

3 - the only way seems to be an explicit copy (i.e. “copyin”) of the data instead of a “device update”, but clearly in such a way I am forced to move data from CPU to the GPU at each iteration.

I am still trying and evaluating the impact

Thanks again
Loriano

Dear Mat,
at the moment I can confirm that removing the “device update” and copying the data via an explicit “copyin” the code is working. Clearly I am losing something in terms of performances (i.e., around 2%), as I need to copy the arrays at each iteration. I’ll keep doing some tests.

Loriano

Hi Loriano,

Engineering just let me know that TPR # 34208 was fixed in 23.11. I double checked the original test case and indeed it does run correctly.

-Mat