Problem with NVFORTRAN and R

erinm.hodgess · September 25, 2020, 7:33pm

Hello!

I’m trying to use nvfortran subroutines. Here is my subroutine

   subroutine test3(n,y,sum1)

  real x(1000),y(1000),sum1
  integer n,i


  !$acc parallel reduction(+:sum1)
  do i=1,n
     y(i) = 2.0*i + 1.0
     sum1 = sum1 + y(i)
  enddo
  !$acc end parallel
!  print *,y(n),sum1

end subroutine test3

Here are my compile/link statements

nvfortran -o mysub2acc.o mysub2acc.f90 -acc -Minfo=accel -c -fPIC
test3:
      7, Generating Tesla code
          8, !$acc loop vector(128) ! threadidx%x
             Generating reduction(+:sum1)
      7, Generating implicit copy(sum1) [if not already present]
         Generating implicit copyout(y(:n)) [if not already present]
      8, Loop is parallelizable
nvfortran -o mysub2acc.so mysub2acc.o -acc -Minfo=accel -shared

So far, so good. But when I run it, I get:

 dyn.load("mysub2acc.so")
 is.loaded("test3")
[1] TRUE
 .Fortran("test3",n=as.single(800),y=as.single(rep(0,800)),sum1=as.single(0.0))

libgomp: TODO

This is on WSL 2 with Ubuntu 20.04 (Windows 10 laptop).

Any suggestions much appreciated.
Sincerely,
Erin

MatColgrove · September 25, 2020, 9:53pm

You’re always blazing trail Erin!

The line “libgomp: TODO” is very odd. This is the GNU OpenMP runtime library (which also includes their OpenACC runtime). If this is indeed the problem, then my best guess is when the shared object gets dynamically loaded, somehow some of the symbols are getting resolved to libgomp as opposed to our runtime.

I’ve seen this happen once before, but only with calls to the OpenACC API, not with directives. And in that case, it just caused bad result to be returned, not a “TODO” message. Our directive calls have internal names that shouldn’t conflict. Hence it may be a different problem.

I don’t really have any good ideas for you other than to try adding “-nomp” when linking the shared object. We link with our OpenMP runtime by default, but only so process to core binding works. So while this shouldn’t conflict either, it’s worth a try.

-Mat

erinm.hodgess · September 25, 2020, 10:20pm

Hi Mat!
Here is the R response:

> dyn.load("mysub2acc")
Error in dyn.load("mysub2acc") :
  unable to load shared object '/home/erinh/mysub2acc':
  /home/erinh/mysub2acc: only ET_DYN and ET_EXEC can be loaded
>

Do we need a !$DEC statement in there, please?

Thanks,
Erin

erinm.hodgess · September 25, 2020, 10:48pm

Here is a solution:


  !$acc loop reduction(+:sum1)
  do i=1,n
     y(i) = 2.0*i + 1.0
     sum1 = sum1 + y(i)
  enddo
  !$acc end loop

The statement !$acc loop reduction(+:sum) seems to solve everything. I was previously using !$acc parallel statement. So we have “a” solution, not necessarily the best one.

Thanks,
Erin

MatColgrove · September 25, 2020, 11:27pm

Do you have a “parallel” region surrounding this? If not, then no OpenACC code will be generated since a loop directive by itself is non-functional.

erinm.hodgess · September 25, 2020, 11:32pm

No parallel region. It won’t take that when tying into R. It is ok with a regular program.

MatColgrove · September 25, 2020, 11:46pm

This really didn’t solve the issue then, unless you don’t want the OpenACC code enabled.

erinm.hodgess · September 25, 2020, 11:49pm

Right. It seems like R doesn’t play well with the parallel region.

erinm.hodgess · September 25, 2020, 11:52pm

As soon as I put the parallel back in, I get the libgomp TODO again

erinm.hodgess · September 26, 2020, 12:45am

So here is a set from regular Windows 10, pgi compiler

subroutine pi1subnv(pi,run_time)
  !DEC$ ATTRIBUTES DLLEXPORT :: pi1subnv
 IMPLICIT NONE

 INTEGER :: i,id
 INTEGER, PARAMETER :: num_steps=100000000

 REAL*8 :: x,pi,sum1,step
 REAL :: start_time, run_time,end_time
 sum1 = 0.0
 step = 1.0 / num_steps
call cpu_time(start_time)



!$acc parallel reduction(+:sum1)
 DO i = 1, num_steps
 x = (i-0.5)*step
 sum1 = sum1 + 4.0 /( 1.0 + x*x)
 ENDDO
!$acc end parallel 

 pi = step * sum1
 call cpu_time(end_time)
 run_time = end_time - start_time
! WRITE(*,*) pi ,run_time

end subroutine pi1subnv

Here is the compilation and linking from the Makefile:

C:\usereg\r-base-master>make
pgfortran -acc -Minfo=accel -Mlarge_arrays -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll  -Bdynamic -lcudafor -lcudaforblas  -ta=tesla:nordc   -Lc:/"Program Files"/PGI/win64/19.10/lib/acc_init_link_cuda.obj    -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccapi.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccg.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccn.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccg2.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libcudadevice.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/pgc.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libnspgc.lib  -defaultlib:legacy_stdio_definitions -defaultlib:oldnames -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll  -Bdynamic  -lcudafor -lcudaforblas -Lc:/"Program Files"/PGI/win64/19.10/lib/libpgf90.lib -c pi1subnv.f90 -o pi1subnv.obj -m64   -Mcuda=cuda10.1
pi1subnv:
     16, Generating Tesla code
         16, Generating reduction(+:sum1)
         17, !$acc loop vector(128) ! threadidx%x
     16, Generating implicit copy(sum1) [if not already present]
     17, Loop is parallelizable
pgfortran -Mmakedll -Bdynamic -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll  -Bdynamic -acc -Minfo=accel -Mlarge_arrays -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll  -Bdynamic -lcudafor -lcudaforblas  -ta=tesla:nordc   -Lc:/"Program Files"/PGI/win64/19.10/lib/acc_init_link_cuda.obj    -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccapi.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccg.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccn.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccg2.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libcudadevice.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/pgc.lib   -Lc:/"Program Files"/PGI/win64/19.10/lib/libnspgc.lib  -defaultlib:legacy_stdio_definitions -defaultlib:oldnames -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll  -Bdynamic  -lcudafor -lcudaforblas -Lc:/"Program Files"/PGI/win64/19.10/lib/libpgf90.lib -acc -o pi1subnv.dll  pi1subnv.obj  -m64  -Mcuda=cuda10.1 -ta=tesla:nordc
   Creating library pi1subnv.lib and object pi1subnv.exp
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_enter referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataenterstart2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataonb referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataenterdone referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_computestart2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_cuda_launchk2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_computedone referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataexitstart2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataoffb2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataexitdone referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_noversion referenced in function pi1subnv_
pi1subnv.dll : fatal error LNK1120: 11 unresolved externals
make: *** [Makefile:17: pi1subnv.obj] Error 2

It’s looking for something…

Thanks,
Erin

MatColgrove · September 26, 2020, 8:16pm

Hi Erin,

For the Window’s DLL issue, there was a know driver issue where it didn’t add the OpenACC runtime libraries by default when building a DLL. So you need to add them to the link line:

pgfortran -Mmakedll -acc -Minfo=accel -Mlarge_arrays -Bdynamic -ta=tesla:nordc,cuda10.1 pi1subnv.obj -o pi1subnv.dll -laccapi -laccg -laccn -laccg2 -lcudadevice

Note “-L” defines a path to a library directory and does not include the library (the -l flag does that), so all your “-L” options are extraneous.

As for the WSL libgmp TODO issue, did you try linking with “-nomp”?

-Mat

erinm.hodgess · September 26, 2020, 9:38pm

No luck on the -nomp

I will give the Windows set another swat in a few minutes.

Thanks!

erinm.hodgess · September 26, 2020, 10:34pm

Yay!

The Windows ran!

Thank you!

erinm.hodgess · September 26, 2020, 10:55pm

Here is something interesting.

The Windows works fine with R-4.0.0, but NOT with R-4.0.2.

Here is the error:
Error: internal error: invalid thread id

That’s definitely weird.

Thanks,
Erin

MatColgrove · September 28, 2020, 3:23pm

That means that the OpenACC runtime isn’t getting initialized. No idea why it worked with R-4.0.0. Are you sure it was using the GPU?

I’ve solved this in the past by adding a DLLmain routine that calls the appropriate runtime initialization routine depending on if I’m use OpenACC to gpus or multicore, and if CUDA is being used as well. The hiccup is that it needs to be compiled by a C++ compiler, which we don’t currently ship on Windows. Hence, I use icpc but would assume you can use MSC++ as well. I’m hoping once we have a C++ compiler on Windows again (it’s in the works but no timeline when it will be released), I’ll have a better solution,

This is an example of the code I use and comment or uncomment out the init calls based on the OpenACC target and if CUDA is being used as well. I have not tested it with the newer NV releases.

    #include <windows.h>
    #include <stdio.h>
    #include <stdlib.h>

    extern "C"
    {
    void __setchk(long*,size_t,size_t);
    void _mp_preinit(void);
    void __pgi_acc_preinit(void);
    void __pgi_uacc_set_link_multicore(void);
    void __pgi_uacc_set_link_cuda(void);
    void __pgi_ctrl_init();
    }
    BOOL WINAPI DllMain(
    HINSTANCE hinstDLL, // handle to DLL module
    DWORD fdwReason, // reason for calling function
    LPVOID lpReserved ) // reserved
    {
    // Perform actions based on the reason for calling.
    switch( fdwReason ) 
    { 
    case DLL_PROCESS_ATTACH:
      long n;

    //  printf("Calling setchk\n");
    //  __setchk(&n+256+128*1024,0,0);
    // printf("Calling acc_preinit\n");
     __pgi_acc_preinit();
    //  printf("Calling mp_preinit\n");
     //  _mp_preinit();
    //  __pgi_uacc_set_link_multicore();
    //  __pgi_ctrl_init();
   //   printf("Calling set link cuda\n");
    //  __pgi_uacc_set_link_cuda();
    break;

    case DLL_THREAD_ATTACH:
    break;

    case DLL_THREAD_DETACH:
    break;

    case DLL_PROCESS_DETACH:
    break;
    }

    return TRUE; // Successful DLL_PROCESS_ATTACH.
    }

erinm.hodgess · September 28, 2020, 4:21pm

Hi Mat:

What options should I use to compile the DLLMain subroutine, please?

Also, once that compiles, how do I tie it in with the Fortran, please?

Thanks,
Erin

MatColgrove · September 28, 2020, 8:02pm

I can post my build script, but I don’t think much. You’ll need to find the correct commands for the C++ compiler you’re using as well as will depend on which linker you’re using.

Example build scipt using PGI 19.10 and Intel’s icl C++ compiler to compile the DLLmain, with xilink to create the DLL, and finally link the DLL into a icl built executable:

pgcc -c -Mdll -acc -ta=tesla:nordc,cc70 -Minfo=accel test_acc.c utils_acc.c  
icl /debug -c myclass.cpp test_dll.cpp
xilink myclass.obj utils_acc.obj test_acc.obj test_dll.obj /out:myclass.dll -nologo -dll -incremental:no "-libpath:C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/VC/Tools/MSVC/14.16.27023/lib/x64" "-libpath:C:/Program Files (x86)/Windows Kits/10/Lib/10.0.17763.0/ucrt/x64" "-libpath:C:/Program Files (x86)/Windows Kits/10/Lib/10.0.17763.0/um/x64" -libpath:C:\PROGRA~1\PGI/win64/19.1/lib -defaultlib:libaccapi -defaultlib:libaccg -defaultlib:libaccn -defaultlib:libaccg2 -defaultlib:libcudadevice -defaultlib:ws2_32.lib -defaultlib:libpgmp -nodefaultlib:libvcruntime -nodefaultlib:libucrt -nodefaultlib:libcmt -defaultlib:msvcrt -defaultlib:pgc -defaultlib:libpgmath -defaultlib:pgmisc -defaultlib:libnspgc -defaultlib:legacy_stdio_definitions -defaultlib:oldnames /DYNAMICBASE:NO
icl /debug main.c /link /DYNAMICBASE:NO

erinm.hodgess · September 28, 2020, 9:17pm

Things are getting a little crazy here. I updated my Visual Studio, and then the pgfortran couldn’t find the link.

I reinstalled PGI 19.10 from the exe file from Downloads. Now it can’t find the license. Any suggestions most welcome.

Sorry for the trouble.

Thanks,
Erin

erinm.hodgess · September 30, 2020, 4:36am

Back again.

I reinstalled the PGI and the license problem went away.

You asked I was sure that we were using the GPU, please? How would I tell, please?

I’m trying to work on the stuff with the DLLmain using cl.exe.

Thanks,
Erin

MatColgrove · September 30, 2020, 5:08pm

You can set the environment variable “PGI_ACC_NOTIFY=1” and the OpenACC runtime will emit a message to stderr each time a compute region is launched on the GPU. And/or you can watch the GPU usage via the “nvidia-smi” command.

Topic		Replies	Views
Compiling Fortran code to run on rtx 4090 Legacy PGI Compilers	29	2163	July 26, 2024
Linking Error in OpenACC Code Legacy PGI Compilers	5	7061	January 6, 2017
Clarification on using OpenACC in a shared library Legacy PGI Compilers	27	4518	December 9, 2020
Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK Technical Blog	27	2299	June 7, 2024
Nvfortran error nvc, nvc++ and nvfortran	39	3320	January 17, 2024
Problems with FORTRAN Accelerator and subroutines Legacy PGI Compilers	21	11922	August 17, 2011
Nvfortran 23.9 problem nvc, nvc++ and nvfortran	17	1244	March 30, 2024
Using Fortran derived types and cuBLAS Legacy PGI Compilers	19	12050	June 24, 2016
Cannot dynamically load a shared library containing both OpenACC and CUDA code nvc, nvc++ and nvfortran	8	2631	August 24, 2022
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	32	December 28, 2024

Problem with NVFORTRAN and R

Related topics