The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU

Hello everyone, I have encountered an issue. The same Fortran code runs on the GPU when I run it standalone, but when I place it into a larger project, it automatically runs on the CPU instead of the GPU. The code is exactly the same. Why does it default to CPU execution when it is part of a larger project?

This is the code.

!$acc data copyin(BT_11, BT_22, BT_33) copyout(result1, result2)
!$acc parallel loop gang vector collapse(2) private(sub1, sub2, sub3, istart2, jstart2, max_cc1, max_cc2) firstprivate(sbox_width,bbox_width)
DO iline = 50, yn_amv-50!229
    DO ielem = 50, xn_amv-50!229
        max_cc1 = -999
        max_cc2 = -999
        istart2 = (ielem-1) * sbox_width/2 + 1
        jstart2 = (iline-1) * sbox_width/2 + 1
        sub2 = BT_22(istart2:istart2+sbox_width-1, jstart2:jstart2+sbox_width-1)
        do x = 1, bbox_width - sbox_width + 1
            do y = 1, bbox_width - sbox_width + 1
                istart1 = max(1, istart2 - (bbox_width-sbox_width)/2 + x - 1)
                jstart1 = max(1, jstart2 - (bbox_width-sbox_width)/2 + y - 1)
                if (istart1 + sbox_width - 1 <= 2748 .and. jstart1 + sbox_width - 1 <= 2748) then
                    sub1 = BT_11(istart1:istart1+sbox_width-1, jstart1:jstart1+sbox_width-1)
                    sub3 = BT_33(istart1:istart1+sbox_width-1, jstart1:jstart1+sbox_width-1)
                    correlation1 = get_matrix_correlation_coef_f(sbox_width, sbox_width, sub2, sub1)
                    correlation2 = get_matrix_correlation_coef_f(sbox_width, sbox_width, sub2, sub3)
                    if (max_cc1 < correlation1) then
                        max_cc1 = correlation1
                    end if
                    if (max_cc2 < correlation2) then
                        max_cc2 = correlation2
                    end if
                end if
            end do
        end do
        result1(ielem, iline) = max_cc1
        result2(ielem, iline) = max_cc2
        print*,'xx',ielem, iline,max_cc1,max_cc2
    END DO
END DO
!$acc end parallel
!$acc end data

this is compile result:

   412, Generating copyin(bt_11(:,:),bt_22(:,:),bt_33(:,:)) [if not already present]
         Generating copyout(result2(:,:),result1(:,:)) [if not already present]
    413, Generating implicit firstprivate(yn_amv,xn_amv)
         Generating NVIDIA GPU code
        414, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        415,   ! blockidx%x threadidx%x collapsed
        421, !$acc loop seq
        422, !$acc loop seq
             Generating implicit reduction(max:max_cc2,max_cc1)
        423, !$acc loop seq
             Generating implicit reduction(max:max_cc2,max_cc1)
        429, !$acc loop seq
    413, Local memory used for sub1,sub3,sub2
    415, Generating implicit firstprivate(x)
    421, Loop is parallelizable
    422, Loop carried reuse of sub1 prevents parallelization
         Complex loop carried dependence of sub1 prevents parallelization
         Loop carried reuse of sub3 prevents parallelization
         Complex loop carried dependence of sub3,sub1 prevents parallelization
         Generating implicit firstprivate(y)
    423, Loop carried reuse of sub1,sub3 prevents parallelization
         Generating implicit firstprivate(correlation2,jstart1,istart1,correlation1)
         Loop carried reuse of sub3 prevents parallelization
    429, Loop is parallelizable
get_matrix_correlation_coef_f:
    537, Generating implicit acc routine seq
         Generating acc routine seq
         Generating NVIDIA GPU code

How are you linking the larger application? Are you using a different compiler than nvfortran or nvc/nvc++?

RDC (relocatable device code) is used by default. It allows for things such as using global data directly from device subroutines as well as calling device routines located in separate files. This requires a device link step, so if you’re missing the GPU flags (i.e. “-acc” or “-gpu=…”), or using a different compiler to link, then this step is missed and the GPU code wont be enabled.

If you are using a different compiler, then ideally you’d switch to use our compilers for the full project. If that’s not practical, then compile the OpenACC code with “-gpu=nordc” to disable RDC. This will then embed the device PTX in the binary which gets JIT compiled at runtime.

There could be other reasons as well, but this is most likely what’s happening. Though if you are using NVHPC to compile the whole application and linking with the correct flags, then let me know and we can look at these.

-Mat

h5fc -acc -gpu=cc89 -Minfo=accel -g -Mvect=levels:6 -c -O3 GEO.f90

This is my compile way.I use h5fc as compiler and it’s based on nvfortran.

(runenv) [user_01@localhost GEO]$ h5fc -v
Export NVCOMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/24.7
Export PGI=/opt/nvidia/hpc_sdk

and the big program compile result as follow:

          calculate_vd_acc:
              984, Generating copyin(bt1(:,:),bt2(:,:),lat(:,:)) [if not already present]
                   Generating copyout(dir2(:,:)) [if not already present]
                   Generating copyin(lon(:,:)) [if not already present]
                   Generating copyout(cc2(:,:,:),sp2(:,:)) [if not already present]
                   Generating copyin(tag2,tag1,overlap_flag,bt3(:,:)) [if not already present]
                   Generating copyout(cc1(:,:,:)) [if not already present]
                   Generating copyin(qf(:,:),delta_t) [if not already present]
                   Generating copyout(sp1(:,:),dir1(:,:)) [if not already present]
              985, Generating implicit firstprivate(line_amv)
                   Generating NVIDIA GPU code
                  986, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
                  987,   ! blockidx%x threadidx%x collapsed
                 1007, !$acc loop seq
                 1011, !$acc loop seq
                 1012, !$acc loop seq
                 1019, !$acc loop seq
              987, Generating implicit firstprivate(nan_value,x,lon3,lat3,lat1,lon1,lat2,lon2)
             1007, Loop is parallelizable
             1011, Loop carried scalar dependence for max_cc1 at line 1028
                   Scalar last value needed after loop for max_cc1 at line 1057,1086
                   Loop carried scalar dependence for max_cc2 at line 1034
                   Scalar last value needed after loop for max_cc2 at line 1058,1086
                   Loop carried dependence of sub1 prevents parallelization
                   Loop carried backward dependence of sub1 prevents vectorization
                   Complex loop carried dependence of sub1 prevents parallelization
                   Loop carried dependence of sub3,sub1 prevents parallelization
                   Loop carried backward dependence of sub3 prevents vectorization
                   Complex loop carried dependence of sub3 prevents parallelization
                   Generating implicit firstprivate(y)
             1012, Loop carried scalar dependence for max_cc1 at line 1028
                   Scalar last value needed after loop for max_cc1 at line 1057,1086
                   Loop carried scalar dependence for max_cc2 at line 1034
                   Scalar last value needed after loop for max_cc2 at line 1058,1086
                   Loop carried dependence of sub1 prevents parallelization
                   Loop carried backward dependence of sub1 prevents vectorization
                   Loop carried dependence of sub3,sub1 prevents parallelization
                   Loop carried backward dependence of sub3 prevents vectorization
                   Generating implicit firstprivate(correlation2,jstart1,istart1,correlation1)
             1019, Loop is parallelizable
          calculate_speed_direction:
             2378, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
          get_matrix_correlation_coef:
             2557, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
          corref_2x2:
             3156, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
          mean_2x2:
             3176, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
          std_2x2:
             3210, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
      nvfortran xx.o yy.o geo.o   -I/home/user_01/software/hdf5/include -L/home/user_01/software/hdf5/lib /home/user_01/software/hdf5/lib/libhdf5hl_fortran.a /home/user_01/software/hdf5/lib/libhdf5_hl.a /home/user_01/software/hdf5/lib/libhdf5_fortran.a /home/user_01/software/hdf5/lib/libhdf5.a -L/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -lsz -lz -ldl -lm -Wl,-rpath -Wl,/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -lmfhdf -ldf -lsz -ljpeg -lz -ltirpc -L/opt/intel/lib/intel64 -lintlc -L/opt/nvidia/hpc_sdk/Linux_x86_64/24.7/compilers/lib -lacchost -o ../GEO.EXE

Thank you very much!

Here’s your link line. I’m not seeing “-acc” or “-gpu=cc89”, which should be added. Also, you’ve explicitly included the OpenACC host runtime (-lacchost) which should be removed.

Thinks for you answer! I modify the link line as follow:

  nvfortran -acc -gpu=cc89 -Minfo=accel -g -Mvect=levels:6 -c -O3 xx.o yy.o geo.o  -I/home/user_01/software/hdf5/include -L/home/user_01/software/hdf5/lib /home/user_01/software/hdf5/lib/libhdf5hl_fortran.a /home/user_01/software/hdf5/lib/libhdf5_hl.a /home/user_01/software/hdf5/lib/libhdf5_fortran.a /home/user_01/software/hdf5/lib/libhdf5.a -L/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -lsz -lz -ldl -lm -Wl,-rpath -Wl,/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -L/home/user_01/software/hdf5/lib -lmfhdf -ldf -lsz -ljpeg -lz -ltirpc -L/opt/intel/lib/intel64 -lintlc -L/opt/nvidia/hpc_sdk/Linux_x86_64/24.7/compilers/lib -o ../GEO.EXE

but it still run on cpu.

and I meet another question,when I compile this code:

!$acc data copyin(data1, data2, data3,lon,lat) copyout(CC1(:,:,:),CC2(:,:,:))
!$acc parallel loop gang vector collapse(2) private(sub1(:,:), sub2(:,:), sub3(:,:), istart2, jstart2, max_cc1, max_cc2, tx1, ty1, tx2, ty2,correlation2,jstart1,istart1,correlation1,lon1,lat1,lon2,lat2,lon3,lat3,txx1,tyy1,txx2,tyy2) firstprivate(delta_t,tag1,tag2,sbox_size,bbox_size)
do i = 1, line_amv
    do j = 1, line_amv
        max_cc1 = -999
        max_cc2 = -999
        istart2 = (i-1) * sbox_size/2 + 1
        jstart2 = (j-1) * sbox_size/2 + 1

        sub2 = data2(istart2:istart2+sbox_size-1, jstart2:jstart2+sbox_size-1)
        .....

it report:

NVFORTRAN-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): No device symbol for address reference 

Could it be because I set sub1 etc. as private variables? And its index inside is related to the global variable data1, etc.

Looks like you also accidently added the “-c” flag which tells the compiler to stop after compiling. In other words, it didn’t get linked and you were using an older binary.

This is a compiler issue where it’s losing track of some variable. It’s generic error so I’d need a reproducing example to know what’s actually causing it.

For the code you show, one possibility is with the implicit temp array used to hold the contents of “data2” before copying to sub2. Fortran specifies that the right hand side of an array syntax statement needs to be fully evaluated before assignment thus a temp array is needed. The compiler can often optimize this away, but since you have “-g”, this optimization is inhibited.

One option is to replace the array syntax with an explicit loop so no temp array is needed. Granted, I’m just guessing here, but worth a try.

Here is my code. I tried to explicitly assign sub1, sub2, and sub3, but I’m still getting the same error: “No device symbol for address reference” (calculate_correlation_allocate.f90: 116).

REAL, ALLOCATABLE :: data1(:,:), data2(:,:), data3(:,:)
REAL, ALLOCATABLE :: lon(:,:), lat(:,:)
REAL, ALLOCATABLE :: sub1(:,:), sub2(:,:), sub3(:,:)
REAL, ALLOCATABLE :: result1(:,:), result2(:,:)
REAL, ALLOCATABLE :: CC1(:,:,:), CC2(:,:,:)

ALLOCATE(data1(line_l1b, line_l1b), data2(line_l1b, line_l1b), data3(line_l1b, line_l1b))
ALLOCATE(lon(line_l1b, line_l1b), lat(line_l1b, line_l1b))
ALLOCATE(sub1(sbox_width, sbox_width), sub2(sbox_width, sbox_width), sub3(sbox_width, sbox_width))
ALLOCATE(CC1(6, line_amv, line_amv), CC2(6, line_amv, line_amv))

!$acc data copyin(data1, data2, data3,lon,lat) copyout(CC1(:,:,:),CC2(:,:,:))
!$acc parallel loop gang vector collapse(2) private(sub1(:,:), sub2(:,:), sub3(:,:), istart2, jstart2, max_cc1, max_cc2, tx1, ty1, tx2, ty2,correlation2,jstart1,istart1,correlation1,lon1,lat1,lon2,lat2,lon3,lat3,txx1,tyy1,txx2,tyy2) firstprivate(delta_t,tag1,tag2,sbox_size,bbox_size)
    do i = 1, line_amv
        do j = 1, line_amv
            max_cc1 = -999
            max_cc2 = -999
            istart2 = (i-1) * sbox_size/2 + 1
            jstart2 = (j-1) * sbox_size/2 + 1
!            sub2 = data2(istart2:istart2+sbox_size-1, jstart2:jstart2+sbox_size-1)
            DO ii = 1, sbox_size
                DO jj = 1, sbox_size
                    sub2(ii, jj) = data2(istart2 + ii - 1, jstart2 + jj - 1)
                END DO
            END DO
            do x = 1, bbox_size - sbox_size + 1
                do y = 1, bbox_size - sbox_size + 1
                    istart1 = max(1, istart2 - (bbox_size-sbox_size)/2 + x - 1)
                    jstart1 = max(1, jstart2 - (bbox_size-sbox_size)/2 + y - 1)
                    if (istart1 + sbox_size - 1 <= line_l1b .and. jstart1 + sbox_size - 1 <= line_l1b) then
!                        sub1 = data1(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)
!                        sub3 = data3(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)
                        DO xx = 1, sbox_size
                            DO yy = 1, sbox_size
                                sub1(xx, yy) = data1(istart1 + xx - 1, jstart1 + yy - 1)
                            END DO
                        END DO
                        DO xx = 1, sbox_size
                            DO yy = 1, sbox_size
                                sub3(xx, yy) = data3(istart1 + xx - 1, jstart1 + yy - 1)
                            END DO
                        END DO
                        correlation1 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub1)
                        correlation2 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub3)
                        if (max_cc1 < correlation1) then
                            max_cc1 = correlation1
                            txx1 = x
                            tyy1 = y
                        end if
                        if (max_cc2 < correlation2) then
                            max_cc2 = correlation2
                            txx2 = x
                            tyy2 = y
                        end if
                    end if
                end do
            end do

Can you please edit the above example or attach the source file to make it a complete reproducer including dependent modules and the definition for “get_matrix_correlation_coef”.

I updated this example to get it to compile, but it doesn’t reproduce the error.

Does “get_matrix_correlation_coef” contain an automatic array? We did have issues with automatics and VLAs in the early 24.x releases where it would give this error, which has since been fixed. I don’t know if this is the same issue, but you might try updating to our latest release.

Thank you for your answer. I simplified the code and shortened it as follows. Its compilation error is the same as before. If I remove sub1, sub2, and sub3 in private, it can be compiled normally, but the calculation result is wrong.

    program CC
    use hdf5
    implicit none
    integer(kind=4)     :: sbox_width, bbox_width, line_l1b, line_amv
    ! Variables
    REAL, ALLOCATABLE :: data1(:,:), data2(:,:), data3(:,:)
    REAL, ALLOCATABLE :: lon(:,:), lat(:,:)
    REAL, ALLOCATABLE :: sub1(:,:), sub2(:,:), sub3(:,:)
    REAL, ALLOCATABLE :: result1(:,:), result2(:,:)
    REAL, ALLOCATABLE :: CC1(:,:,:), CC2(:,:,:)
    integer :: sbox_size, bbox_size, tx1, ty1, tx2, ty2,txx1,tyy1,txx2,tyy2,tx,ty
    integer :: i, j,ii,jj, istart1, jstart1, istart2, jstart2
    integer :: x, y, xx, yy
    real :: correlation1, correlation2, max_cc1, max_cc2
    !  assign value
    sbox_size = sbox_width
    bbox_size = bbox_width
    ALLOCATE(data1(line_l1b, line_l1b), data2(line_l1b, line_l1b), data3(line_l1b, line_l1b))
    ALLOCATE(lon(line_l1b, line_l1b), lat(line_l1b, line_l1b))
    ALLOCATE(sub1(sbox_width, sbox_width), sub2(sbox_width, sbox_width), sub3(sbox_width, sbox_width))
    ALLOCATE(CC1(6, line_amv, line_amv), CC2(6, line_amv, line_amv))
    !$acc data copyin(data1, data2, data3,lon,lat) copyout(CC1(:,:,:),CC2(:,:,:))
    !$acc parallel loop gang vector collapse(2) private(sub1(:,:), sub2(:,:), sub3(:,:), istart2, jstart2, max_cc1, max_cc2, tx1, ty1, tx2, ty2,correlation2,jstart1,istart1,correlation1,txx1,tyy1,txx2,tyy2) firstprivate(sbox_size,bbox_size)
    do i = 1, line_amv
        do j = 1, line_amv
            max_cc1 = -999
            max_cc2 = -999
            istart2 = (i-1) * sbox_size/2 + 1
            jstart2 = (j-1) * sbox_size/2 + 1
            !sub2 = data2(istart2:istart2+sbox_size-1, jstart2:jstart2+sbox_size-1)
            DO ii = 1, sbox_size
                DO jj = 1, sbox_size
                    sub2(ii, jj) = data2(istart2 + ii - 1, jstart2 + jj - 1)
                END DO
            END DO
            do x = 1, bbox_size - sbox_size + 1
                do y = 1, bbox_size - sbox_size + 1
                    istart1 = max(1, istart2 - (bbox_size-sbox_size)/2 + x - 1)
                    jstart1 = max(1, jstart2 - (bbox_size-sbox_size)/2 + y - 1)
                    if (istart1 + sbox_size - 1 <= line_l1b .and. jstart1 + sbox_size - 1 <= line_l1b) then
                        !sub1 = data1(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)
                        !sub3 = data3(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)
                        DO xx = 1, sbox_size
                            DO yy = 1, sbox_size
                                sub1(xx, yy) = data1(istart1 + xx - 1, jstart1 + yy - 1)
                            END DO
                        END DO
                        DO xx = 1, sbox_size
                            DO yy = 1, sbox_size
                                sub3(xx, yy) = data3(istart1 + xx - 1, jstart1 + yy - 1)
                            END DO
                        END DO
                        correlation1 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub1)
                        correlation2 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub3)
                        if (max_cc1 < correlation1) then
                            max_cc1 = correlation1
                            txx1 = x
                            tyy1 = y
                        end if
                        if (max_cc2 < correlation2) then
                            max_cc2 = correlation2
                            txx2 = x
                            tyy2 = y
                        end if

                    end if
                end do
            end do

        end do
    end do
    !$acc end parallel
    !$acc end data
    DEALLOCATE(data1, data2, data3, lon, lat)
    DEALLOCATE(sub1, sub2, sub3)
    DEALLOCATE(CC1, CC2)
contains
    !1$acc routine seq
    FUNCTION get_matrix_correlation_coef(nx, ny, t, s) RESULT(r)
        INTEGER, value,INTENT(IN)                :: nx, ny
        REAL, DIMENSION(:,:), INTENT(IN)   :: t, s
        REAL                               :: r
        REAL                               :: E_t, E_s, E_ts, sita_t, sita_s

        E_t = SUM(t) / (nx * ny)
        E_s = SUM(s) / (nx * ny)
        E_ts = SUM((t - E_t) * (s - E_s)) / (nx * ny)
        sita_t = SQRT(SUM((t - E_t)**2) / (nx * ny))
        sita_s = SQRT(SUM((s - E_s)**2) / (nx * ny))
!        r = E_ts / (sita_t * sita_s)
        IF (sita_t == 0.0 .OR. sita_s == 0.0) THEN
            r = 0.0
        ELSE
            r = E_ts / (sita_t * sita_s)
        END IF
        if(r .gt. 1) then
            r = 0
        end if
!        print*,'rr',r
    END FUNCTION get_matrix_correlation_coef
end program CC

Thanks! I was able to reproduce the issue.

The problem has to do with passing private allocable arrays as assumed shape which requires each private array to have it’s own local descriptor. This is problematic to manage and can create a huge performance penalty. The same issue occurs with CUDA Fortran, standard language parallelism, and OpenMP offload.

Hence we recommend users not pass assumed-shape arrays. Instead use assumed-size or fixed size arrays which do not require a descriptor.

Here you’re using the SUM and SQRT intrinsics, so wont be able to use assumed-size. If you can make them fixed size, then it should work. Though if that’s too limiting, the third option would be to manually inline the routine. I manually inlined the routine and it worked around the error.

Note that contained device subroutines can also be problematic so I’d recommend moving the definition for these routines into a module or a separate subroutine with an interface instead. The problem being that contained subroutines have a hidden argument to the parent’s stack so the child can access the parent’s variables. Since the stack is on the host, this will give illegal address errors that are hard to debug.

In this particular case, you’re only using the passed in variables, not the parent’s variables, so are ok, but it’s good practice to avoid contained device subroutines.

    module foo

    contains
    !$acc routine seq
    FUNCTION get_matrix_correlation_coef(nx, ny, t, s) RESULT(r)
        INTEGER, value,INTENT(IN)                :: nx, ny
        REAL, DIMENSION(:,:), INTENT(IN)   :: t, s
        REAL                               :: r
        REAL                               :: E_t, E_s, E_ts, sita_t, sita_s

        E_t = SUM(t) / (nx * ny)
        E_s = SUM(s) / (nx * ny)
        E_ts = SUM((t - E_t) * (s - E_s)) / (nx * ny)
        sita_t = SQRT(SUM((t - E_t)**2) / (nx * ny))
        sita_s = SQRT(SUM((s - E_s)**2) / (nx * ny))
!        r = E_ts / (sita_t * sita_s)
        IF (sita_t == 0.0 .OR. sita_s == 0.0) THEN
            r = 0.0
        ELSE
            r = E_ts / (sita_t * sita_s)
        END IF
        if(r .gt. 1) then
            r = 0
        end if
!        print*,'rr',r
    END FUNCTION get_matrix_correlation_coef


    end module foo


    program CC
!    use hdf5
    use foo
    implicit none
    integer(kind=4)     :: sbox_width, bbox_width, line_l1b, line_amv
    ! Variables
    REAL, ALLOCATABLE :: data1(:,:), data2(:,:), data3(:,:)
    REAL, ALLOCATABLE :: lon(:,:), lat(:,:)
    REAL, ALLOCATABLE :: sub1(:,:), sub2(:,:), sub3(:,:)
    REAL, ALLOCATABLE :: result1(:,:), result2(:,:)
    REAL, ALLOCATABLE :: CC1(:,:,:), CC2(:,:,:)
    integer :: sbox_size, bbox_size, tx1, ty1, tx2, ty2,txx1,tyy1,txx2,tyy2,tx,ty
    integer :: i, j,ii,jj, istart1, jstart1, istart2, jstart2
    integer :: x, y, xx, yy
    real :: correlation1, correlation2, max_cc1, max_cc2
    REAL :: E_t, E_s, E_ts, sita_t, sita_s
    !  assign value
    sbox_size = sbox_width
    bbox_size = bbox_width
    ALLOCATE(data1(line_l1b, line_l1b), data2(line_l1b, line_l1b), data3(line_l1b, line_l1b))
    ALLOCATE(lon(line_l1b, line_l1b), lat(line_l1b, line_l1b))
    ALLOCATE(sub1(sbox_width, sbox_width), sub2(sbox_width, sbox_width), sub3(sbox_width, sbox_width))
    ALLOCATE(CC1(6, line_amv, line_amv), CC2(6, line_amv, line_amv))
    !$acc data copyin(data1, data2, data3,lon,lat) copyout(CC1(:,:,:),CC2(:,:,:))
    !$acc parallel loop gang vector collapse(2) private(sub1(:,:), sub2(:,:), sub3(:,:), istart2, jstart2, max_cc1, max_cc2, tx1, ty1, tx2, ty2,correlation2,jstart1,istart1,correlation1,txx1,tyy1,txx2,tyy2) firstprivate(sbox_size,bbox_size)
    do i = 1, line_amv
        do j = 1, line_amv
            max_cc1 = -999
            max_cc2 = -999
            istart2 = (i-1) * sbox_size/2 + 1
            jstart2 = (j-1) * sbox_size/2 + 1
            !sub2 = data2(istart2:istart2+sbox_size-1, jstart2:jstart2+sbox_size-1)
            DO ii = 1, sbox_size
                DO jj = 1, sbox_size
                    sub2(ii, jj) = data2(istart2 + ii - 1, jstart2 + jj - 1)
                END DO
            END DO
            do x = 1, bbox_size - sbox_size + 1
                do y = 1, bbox_size - sbox_size + 1
                    istart1 = max(1, istart2 - (bbox_size-sbox_size)/2 + x - 1)
                    jstart1 = max(1, jstart2 - (bbox_size-sbox_size)/2 + y - 1)
                    if (istart1 + sbox_size - 1 <= line_l1b .and. jstart1 + sbox_size - 1 <= line_l1b) then
                        !sub1 = data1(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)
                        !sub3 = data3(istart1:istart1+sbox_size-1, jstart1:jstart1+sbox_size-1)
                        DO xx = 1, sbox_size
                            DO yy = 1, sbox_size
                                sub1(xx, yy) = data1(istart1 + xx - 1, jstart1 + yy - 1)
                            END DO
                        END DO
                        DO xx = 1, sbox_size
                            DO yy = 1, sbox_size
                                sub3(xx, yy) = data3(istart1 + xx - 1, jstart1 + yy - 1)
                            END DO
                        END DO
! begin inline
!                        correlation1 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub1)
        E_t = SUM(sub2) / (sbox_size * sbox_size)
        E_s = SUM(sub1) / (sbox_size * sbox_size)
        E_ts = SUM((sub2 - E_t) * (sub1 - E_s)) / (sbox_size * sbox_size)
        sita_t = SQRT(SUM((sub2 - E_t)**2) / (sbox_size * sbox_size))
        sita_s = SQRT(SUM((sub1 - E_s)**2) / (sbox_size * sbox_size))
        IF (sita_t == 0.0 .OR. sita_s == 0.0) THEN
            correlation1 = 0.0
        ELSE
            correlation1 = E_ts / (sita_t * sita_s)
        END IF
        if(correlation1 .gt. 1) then
            correlation1 = 0
        end if
! end inline

! begin inline
!                        correlation2 = get_matrix_correlation_coef(sbox_size, sbox_size, sub2, sub3)
        E_t = SUM(sub2) / (sbox_size * sbox_size)
        E_s = SUM(sub3) / (sbox_size * sbox_size)
        E_ts = SUM((sub2 - E_t) * (sub3 - E_s)) / (sbox_size * sbox_size)
        sita_t = SQRT(SUM((sub2 - E_t)**2) / (sbox_size * sbox_size))
        sita_s = SQRT(SUM((sub3 - E_s)**2) / (sbox_size * sbox_size))
        IF (sita_t == 0.0 .OR. sita_s == 0.0) THEN
            correlation2 = 0.0
        ELSE
            correlation2 = E_ts / (sita_t * sita_s)
        END IF
        if(correlation2 .gt. 1) then
            correlation2 = 0
        end if
! end inline
                        if (max_cc1 < correlation1) then
                            max_cc1 = correlation1
                            txx1 = x
                            tyy1 = y
                        end if
                        if (max_cc2 < correlation2) then
                            max_cc2 = correlation2
                            txx2 = x
                            tyy2 = y
                        end if

                    end if
                end do
            end do

        end do
    end do
    !$acc end parallel
    !$acc end data
    DEALLOCATE(data1, data2, data3, lon, lat)
    DEALLOCATE(sub1, sub2, sub3)
    DEALLOCATE(CC1, CC2)
end program CC
1 Like

Thank you very much for your help. Indeed, the issue was caused by the passing of private arrays. Following your suggestions, I have resolved the problem. Thanks again!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.