Openacc fortran acc routine error [nvlink error : undefined reference to 'subroutine_name_' in 'file_name']

Hello
I want to ask about using openACC at fortran.
I used nvhpc/22.11 and nvfortran compiler.

I meet complie error below message
(trdiag3p_gpu = subroutine_in)

nvlink error : Undefined reference to ‘trdiag3p_gpu_’ in ‘lica.o’
pgacclnk: child process exit status 2: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/tools/nvdd
make: *** [Makefile:27: lica] Error 2

I don’t know correct openACC useage.
As you know “subroutine_out” is parallelized by !$acc parallel, !$acc loop and call the “subroutine_in”.
AK, BK, CK, GK array are already exist gpu buffer.
I use !$acc enter data copyin AK, BK, CK, GK in subroutine_out.
As i known fortran always call by reference so when in the subroutine_in the “A” means “AK” array.

In “subroutine_in” i want to use array such as A, B, C, F (my opinion it is AK, BK, CK, GK) but when i use these array i meet nvlink error.

Could i ask this problem?
And also i already search about !$acc routine and find this Q&A.

[Since PGI doesn’t support nested parallelism,]
If i understand this Q&A correct, I already parallel compute at the outer function(subroutine_out). So i have to call subroutine_in sequential? is it right?

Through this forums , I have already helped you a lot.
Thanks. Mat Colgrove

(I have security issue, so i block most of codes.
If you need more information, please tell me.)


PROGRAM MAIN

    ......
    ......
    call subroutine_out()
    ......
    ......

    STOP
END


SUBROUTINE subroutine_out

    USE MOD_COMMON
    IMPLICIT NONE
    INTEGER*8     :: I,J,K
    REAL*8        :: CRE
    REAL*8, DIMENSION (:,:), ALLOCATABLE :: AK,BK,CK,GK

    !$acc routine(subroutine_in)

    allocate(AK(M1,M3),BK(M1,M3),CK(M1,M3),GK(M1,M3))

    !$acc enter data copyin ( AK(1:M1, 1:M3), BK(1:M1, 1:M3), CK(1:M1, 1:M3), GK(1:M1, 1:M3) )
    !$acc enter data copyin ( RHS1(0:M1, 0:M2, 0:M3, 3) )
    ! RHS1 define at MOD_COMMON
    

    !$acc parallel
    !$acc loop
    do J=1,200
        !$acc loop collapse(2) !independent
        do K=1,500
            do I=1,N1M
                AK(I,K)= ......
                BK(I,K)= ......
                CK(I,K)= ......
                GK(I,K)= ......
            enddo
        enddo
        !$acc end loop

        CALL subroutine_in(AK,BK,CK,GK,1,500,1,500)
        
        !$acc loop collapse(2) independent
        do K=1,500
            do I=1,500
                RHS1(I,J,K,1)= ......
            enddo
        enddo
        !$acc end loop
        
    enddo
    !$acc end loop
    !$acc end parallel

    !$acc exit data delete ( AK(1:M1, 1:M3), BK(1:M1, 1:M3), CK(1:M1, 1:M3), GK(1:M1, 1:M3) )
END


SUBROUTINE subroutine_in(A, B, C, F, J1,J2,L1,L2)
    !$acc routine seq 
    !seq or vector

    USE MOD_COMMON
    IMPLICIT NONE
    REAL*8    :: A(M1,M3),B(M1,M3),C(M1,M3),F(M1,M3)
    INTEGER*8 :: J1,J2,L1,L2
    INTEGER*8 :: I,J,K
    REAL*8    :: Q(M1,M3),S(M1,M3),FN(M1)
    REAL*8    :: BINV

    !$acc data present ( A(1:M1, 1:M3), B(1:M1, 1:M3), C(1:M1, 1:M3), F(1:M1, 1:M3) )
    !$acc data copyin ( Q(M1,M3), S(M1,M3), QE(M1,M3), FN(M1), PN(M1) )
    
    do K=L1,L2
        BINV= ......
        Q(K,J1)= ......
        S(K,J1)= ......
        FN(K)= ......
        F(K,J1)= ......
    enddo
    !$acc end data
    !$acc end data
END
    
    

Hi leejsera,

Just to ensure I understand. In the actual program, the subroutine is “trdiag3p_gpu_” but you’ve included a representative example which show the structure of the program. In the example, “subroutine_in” represents “trdiag3p_gpu”.

Typically this error means that no device code could be found for the subroutine. Often due to the user forgetting to use the “acc routine” directive which tells the compiler to create the device version.

Here you do have a “acc routine” directive, but also have two data regions. Data regions can only be used from host code and may be inhibiting the compiler from generating a device version.

You should be able to tell if this is the case by adding the flag “-Minfo=accel” where the compiler will indicate if it was able to successfully create the device routine.

In any event, you should remove the data directives from the device routine.

If that doesn’t solve it, let me know, and we can trouble shoot the problem. In some cases, I’ve seen the order in which the objects are presented on the link line can cause this issue as well.

-Mat

1 Like

Yes! you are right trdiag3p_gpu is subroutine_in.
And i really appreciate to your answer.

Here you do have a “acc routine” directive, but also have two data regions. Data regions can only be used from host code and may be inhibiting the compiler from generating a device version.

Now i understood Data regions can only be used from host code.
I have some questions.

  1. If i used “!$acc routine seq” at subroutine_in, the compiler will be compile the subroutine to device version.
    Is the code where inside of subroutine_in will be run at gpu?
    Does it means the subroutine_in is a device code?
    As i known the inside of !$acc parallel is device code. other regions are host code region.

  2. I already parallelize the code of “subroutine_out” using !$acc parallel, !$acc loop.
    So i can’t parallelize the “subroutine_in”, only can use !$acc routine seq or !$acc routine vector (I understand this is small parallelize)
    is it right?

  3. If the answer of Q.2 is right, I can’t use “!$acc parallel loop” in subroutine_in, but I wonder if it can be used by replacing the codes of the subroutine without calling Subroutine_in.

  4. I remove the !$acc data regions in “subroutine_in” but i can’t compiled.
    I meet same nvlink error. (subroutine_in is same trdiag3p_gpu)

trdiag3p_gpu:
1763, Accelerator restriction: Indirect function/procedure calls are not supported
0 inform, 1 warnings, 0 severes, 0 fatal for trdiag3p_gpu
test:
3255, Generating acc routine seq
Generating NVIDIA GPU code
0 inform, 1 warnings, 0 severes, 0 fatal for trdiag3p
nvlink error : Undefined reference to ‘trdiag3p_gpu_’ in ‘lica.o’
pgacclnk: child process exit status 2: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/tools/nvdd
make: *** [Makefile:27: lica] Error 2

Is the code where inside of subroutine_in will be run at gpu?

Adding a “routine” directive on a subroutine will have the compiler create a device routine that can be called from other routines or from within an OpenACC compute region.

Does it means the subroutine_in is a device code?

Unless you add the “nohost” clause, the compiler will create two versions of the subroutine, one for the host and one for the device. Hence, it will depend from where it will be run.

Though as for the rules on where OpenACC directives can be placed, i.e. no data directives or compute regions, then yes it’s considered device code.

So i can’t parallelize the “subroutine_in”, only can use !$acc routine seq or !$acc routine vector (I understand this is small parallelize)
is it right?

“routine seq” will create a sequential subroutine, i.e. no parallelism.

However, “routine vector”, allows for parallelism across vectors. You’d define which loop to parallelize via the “loop” directive.

If the answer of Q.2 is right, I can’t use “!$acc parallel loop” in subroutine_in, but I wonder if it can be used by replacing the codes of the subroutine without calling Subroutine_in.`

Keep in mind that “parallel” and “loop” are two different directives. They are often combined but mean different things.

“parallel” defines the compute region to offload. While the OpenACC standard allows for nested parallelism (i.e. having compute regions within other compute regions), we don’t support this. Basically this will have one CUDA kernel be launched from with another CUDA kernel which is not useful except for a very small number of use cases.

“loop” defines where to distribute the work across the different levels of parallelism, i.e. gang, worker, or vector. “loop” may be used within a compute region or device subroutine if using “routine vector”, “routine worker”, or “routine gang”.

I remove the !$acc data regions in “subroutine_in” but i can’t compiled.
I meet same nvlink error. (subroutine_in is same trdiag3p_gpu)

Ok, that was just a guess, but it doesn’t look like the device subroutine can be generated due to the following:

1763, Accelerator restriction: Indirect function/procedure calls are not supported

Looks like you might have a procedure pointer in there. While were working on adding this support, it’s not available as of yet. You’ll need to have the device code make direct calls.

-Mat

1 Like

Thanks to your reply.
I found the problem.
the matrix size M1, M2, M3 was variable.
I think it was parameter but it is just variable. It’s my mistake.
The compiler couldn’t know the size of 2d array at compile time.

And i solved this problem to change the parallization structure.

Once again, thanks to your reply!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.