Hello!
I’m trying to use nvfortran subroutines. Here is my subroutine
subroutine test3(n,y,sum1)
real x(1000),y(1000),sum1
integer n,i
!$acc parallel reduction(+:sum1)
do i=1,n
y(i) = 2.0*i + 1.0
sum1 = sum1 + y(i)
enddo
!$acc end parallel
! print *,y(n),sum1
end subroutine test3
Here are my compile/link statements
nvfortran -o mysub2acc.o mysub2acc.f90 -acc -Minfo=accel -c -fPIC
test3:
7, Generating Tesla code
8, !$acc loop vector(128) ! threadidx%x
Generating reduction(+:sum1)
7, Generating implicit copy(sum1) [if not already present]
Generating implicit copyout(y(:n)) [if not already present]
8, Loop is parallelizable
nvfortran -o mysub2acc.so mysub2acc.o -acc -Minfo=accel -shared
So far, so good. But when I run it, I get:
dyn.load("mysub2acc.so")
is.loaded("test3")
[1] TRUE
.Fortran("test3",n=as.single(800),y=as.single(rep(0,800)),sum1=as.single(0.0))
libgomp: TODO
This is on WSL 2 with Ubuntu 20.04 (Windows 10 laptop).
Any suggestions much appreciated.
Sincerely,
Erin
You’re always blazing trail Erin!
The line “libgomp: TODO” is very odd. This is the GNU OpenMP runtime library (which also includes their OpenACC runtime). If this is indeed the problem, then my best guess is when the shared object gets dynamically loaded, somehow some of the symbols are getting resolved to libgomp as opposed to our runtime.
I’ve seen this happen once before, but only with calls to the OpenACC API, not with directives. And in that case, it just caused bad result to be returned, not a “TODO” message. Our directive calls have internal names that shouldn’t conflict. Hence it may be a different problem.
I don’t really have any good ideas for you other than to try adding “-nomp” when linking the shared object. We link with our OpenMP runtime by default, but only so process to core binding works. So while this shouldn’t conflict either, it’s worth a try.
-Mat
Hi Mat!
Here is the R response:
> dyn.load("mysub2acc")
Error in dyn.load("mysub2acc") :
unable to load shared object '/home/erinh/mysub2acc':
/home/erinh/mysub2acc: only ET_DYN and ET_EXEC can be loaded
>
Do we need a !$DEC statement in there, please?
Thanks,
Erin
Here is a solution:
!$acc loop reduction(+:sum1)
do i=1,n
y(i) = 2.0*i + 1.0
sum1 = sum1 + y(i)
enddo
!$acc end loop
The statement !$acc loop reduction(+:sum) seems to solve everything. I was previously using !$acc parallel statement. So we have “a” solution, not necessarily the best one.
Thanks,
Erin
Do you have a “parallel” region surrounding this? If not, then no OpenACC code will be generated since a loop directive by itself is non-functional.
No parallel region. It won’t take that when tying into R. It is ok with a regular program.
This really didn’t solve the issue then, unless you don’t want the OpenACC code enabled.
Right. It seems like R doesn’t play well with the parallel region.
As soon as I put the parallel back in, I get the libgomp TODO again
So here is a set from regular Windows 10, pgi compiler
subroutine pi1subnv(pi,run_time)
!DEC$ ATTRIBUTES DLLEXPORT :: pi1subnv
IMPLICIT NONE
INTEGER :: i,id
INTEGER, PARAMETER :: num_steps=100000000
REAL*8 :: x,pi,sum1,step
REAL :: start_time, run_time,end_time
sum1 = 0.0
step = 1.0 / num_steps
call cpu_time(start_time)
!$acc parallel reduction(+:sum1)
DO i = 1, num_steps
x = (i-0.5)*step
sum1 = sum1 + 4.0 /( 1.0 + x*x)
ENDDO
!$acc end parallel
pi = step * sum1
call cpu_time(end_time)
run_time = end_time - start_time
! WRITE(*,*) pi ,run_time
end subroutine pi1subnv
Here is the compilation and linking from the Makefile:
C:\usereg\r-base-master>make
pgfortran -acc -Minfo=accel -Mlarge_arrays -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll -Bdynamic -lcudafor -lcudaforblas -ta=tesla:nordc -Lc:/"Program Files"/PGI/win64/19.10/lib/acc_init_link_cuda.obj -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccapi.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccg.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccn.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccg2.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libcudadevice.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/pgc.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libnspgc.lib -defaultlib:legacy_stdio_definitions -defaultlib:oldnames -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll -Bdynamic -lcudafor -lcudaforblas -Lc:/"Program Files"/PGI/win64/19.10/lib/libpgf90.lib -c pi1subnv.f90 -o pi1subnv.obj -m64 -Mcuda=cuda10.1
pi1subnv:
16, Generating Tesla code
16, Generating reduction(+:sum1)
17, !$acc loop vector(128) ! threadidx%x
16, Generating implicit copy(sum1) [if not already present]
17, Loop is parallelizable
pgfortran -Mmakedll -Bdynamic -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll -Bdynamic -acc -Minfo=accel -Mlarge_arrays -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll -Bdynamic -lcudafor -lcudaforblas -ta=tesla:nordc -Lc:/"Program Files"/PGI/win64/19.10/lib/acc_init_link_cuda.obj -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccapi.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccg.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccn.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libaccg2.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libcudadevice.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/pgc.lib -Lc:/"Program Files"/PGI/win64/19.10/lib/libnspgc.lib -defaultlib:legacy_stdio_definitions -defaultlib:oldnames -Lc:/"Program Files"/PGI/win64/19.10/bin/pgf90.dll -Lc:/"Program Files"/PGI/win64/19.10/bin/pgc.dll -Bdynamic -lcudafor -lcudaforblas -Lc:/"Program Files"/PGI/win64/19.10/lib/libpgf90.lib -acc -o pi1subnv.dll pi1subnv.obj -m64 -Mcuda=cuda10.1 -ta=tesla:nordc
Creating library pi1subnv.lib and object pi1subnv.exp
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_enter referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataenterstart2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataonb referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataenterdone referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_computestart2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_cuda_launchk2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_computedone referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataexitstart2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataoffb2 referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_dataexitdone referenced in function pi1subnv_
pi1subnv.obj : error LNK2019: unresolved external symbol __pgi_uacc_noversion referenced in function pi1subnv_
pi1subnv.dll : fatal error LNK1120: 11 unresolved externals
make: *** [Makefile:17: pi1subnv.obj] Error 2
It’s looking for something…
Thanks,
Erin
Hi Erin,
For the Window’s DLL issue, there was a know driver issue where it didn’t add the OpenACC runtime libraries by default when building a DLL. So you need to add them to the link line:
pgfortran -Mmakedll -acc -Minfo=accel -Mlarge_arrays -Bdynamic -ta=tesla:nordc,cuda10.1 pi1subnv.obj -o pi1subnv.dll -laccapi -laccg -laccn -laccg2 -lcudadevice
Note “-L” defines a path to a library directory and does not include the library (the -l flag does that), so all your “-L” options are extraneous.
As for the WSL libgmp TODO issue, did you try linking with “-nomp”?
-Mat
No luck on the -nomp
I will give the Windows set another swat in a few minutes.
Thanks!
Here is something interesting.
The Windows works fine with R-4.0.0, but NOT with R-4.0.2.
Here is the error:
Error: internal error: invalid thread id
That’s definitely weird.
Thanks,
Erin
That means that the OpenACC runtime isn’t getting initialized. No idea why it worked with R-4.0.0. Are you sure it was using the GPU?
I’ve solved this in the past by adding a DLLmain routine that calls the appropriate runtime initialization routine depending on if I’m use OpenACC to gpus or multicore, and if CUDA is being used as well. The hiccup is that it needs to be compiled by a C++ compiler, which we don’t currently ship on Windows. Hence, I use icpc but would assume you can use MSC++ as well. I’m hoping once we have a C++ compiler on Windows again (it’s in the works but no timeline when it will be released), I’ll have a better solution,
This is an example of the code I use and comment or uncomment out the init calls based on the OpenACC target and if CUDA is being used as well. I have not tested it with the newer NV releases.
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
extern "C"
{
void __setchk(long*,size_t,size_t);
void _mp_preinit(void);
void __pgi_acc_preinit(void);
void __pgi_uacc_set_link_multicore(void);
void __pgi_uacc_set_link_cuda(void);
void __pgi_ctrl_init();
}
BOOL WINAPI DllMain(
HINSTANCE hinstDLL, // handle to DLL module
DWORD fdwReason, // reason for calling function
LPVOID lpReserved ) // reserved
{
// Perform actions based on the reason for calling.
switch( fdwReason )
{
case DLL_PROCESS_ATTACH:
long n;
// printf("Calling setchk\n");
// __setchk(&n+256+128*1024,0,0);
// printf("Calling acc_preinit\n");
__pgi_acc_preinit();
// printf("Calling mp_preinit\n");
// _mp_preinit();
// __pgi_uacc_set_link_multicore();
// __pgi_ctrl_init();
// printf("Calling set link cuda\n");
// __pgi_uacc_set_link_cuda();
break;
case DLL_THREAD_ATTACH:
break;
case DLL_THREAD_DETACH:
break;
case DLL_PROCESS_DETACH:
break;
}
return TRUE; // Successful DLL_PROCESS_ATTACH.
}
Hi Mat:
What options should I use to compile the DLLMain subroutine, please?
Also, once that compiles, how do I tie it in with the Fortran, please?
Thanks,
Erin
I can post my build script, but I don’t think much. You’ll need to find the correct commands for the C++ compiler you’re using as well as will depend on which linker you’re using.
Example build scipt using PGI 19.10 and Intel’s icl C++ compiler to compile the DLLmain, with xilink to create the DLL, and finally link the DLL into a icl built executable:
pgcc -c -Mdll -acc -ta=tesla:nordc,cc70 -Minfo=accel test_acc.c utils_acc.c
icl /debug -c myclass.cpp test_dll.cpp
xilink myclass.obj utils_acc.obj test_acc.obj test_dll.obj /out:myclass.dll -nologo -dll -incremental:no "-libpath:C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/VC/Tools/MSVC/14.16.27023/lib/x64" "-libpath:C:/Program Files (x86)/Windows Kits/10/Lib/10.0.17763.0/ucrt/x64" "-libpath:C:/Program Files (x86)/Windows Kits/10/Lib/10.0.17763.0/um/x64" -libpath:C:\PROGRA~1\PGI/win64/19.1/lib -defaultlib:libaccapi -defaultlib:libaccg -defaultlib:libaccn -defaultlib:libaccg2 -defaultlib:libcudadevice -defaultlib:ws2_32.lib -defaultlib:libpgmp -nodefaultlib:libvcruntime -nodefaultlib:libucrt -nodefaultlib:libcmt -defaultlib:msvcrt -defaultlib:pgc -defaultlib:libpgmath -defaultlib:pgmisc -defaultlib:libnspgc -defaultlib:legacy_stdio_definitions -defaultlib:oldnames /DYNAMICBASE:NO
icl /debug main.c /link /DYNAMICBASE:NO
Things are getting a little crazy here. I updated my Visual Studio, and then the pgfortran couldn’t find the link.
I reinstalled PGI 19.10 from the exe file from Downloads. Now it can’t find the license. Any suggestions most welcome.
Sorry for the trouble.
Thanks,
Erin
Back again.
I reinstalled the PGI and the license problem went away.
You asked I was sure that we were using the GPU, please? How would I tell, please?
I’m trying to work on the stuff with the DLLmain using cl.exe.
Thanks,
Erin
You can set the environment variable “PGI_ACC_NOTIFY=1” and the OpenACC runtime will emit a message to stderr each time a compute region is launched on the GPU. And/or you can watch the GPU usage via the “nvidia-smi” command.