Question about openacc autocompare with MPI

Hi Mat I Have some questions about autocompare flag when i using MPI functions.

My enviroments.
nvhpc/22.11
use Makefile flags
ACC = -fast -acc -Minfo=accel -ta=tesla:autocompare

I know how to i use autocompare function.
but when i using MPI functions it seems malfunction.

This is my code.

    subroutine mpi_subdomain_ghostcell_update(Value_sub)

        use mpi_topology!,   only : comm_1d_x1, comm_1d_x2, comm_1d_x3

        implicit none

        double precision, dimension(0:n1sub, 0:n2sub, 0:n3sub), intent(inout) :: Value_sub

        integer :: ierr
        integer :: request(4),rr(2)

        ! Update the ghostcells in the x-direction using derived datatypes and subcommunicator.
        !$acc host_data use_device(Value_sub)
        call MPI_Isend(Value_sub,1, ddtype_sendto_E  , comm_1d_x1%east_rank, 111, comm_1d_x1%mpi_comm, request(1), ierr)
        call MPI_Irecv(Value_sub,1, ddtype_recvfrom_W, comm_1d_x1%west_rank, 111, comm_1d_x1%mpi_comm, request(2), ierr)
        call MPI_Isend(Value_sub,1, ddtype_sendto_W  , comm_1d_x1%west_rank, 222, comm_1d_x1%mpi_comm, request(3), ierr)
        call MPI_Irecv(Value_sub,1, ddtype_recvfrom_E, comm_1d_x1%east_rank, 222, comm_1d_x1%mpi_comm, request(4), ierr)
        call MPI_Waitall(4, request, MPI_STATUSES_IGNORE, ierr)
        ! Update the ghostcells in the y-direction using derived datatypes and subcommunicator.
        call MPI_Isend(Value_sub,1, ddtype_sendto_N  , comm_1d_x2%east_rank, 111, comm_1d_x2%mpi_comm, request(1), ierr)
        call MPI_Irecv(Value_sub,1, ddtype_recvfrom_S, comm_1d_x2%west_rank, 111, comm_1d_x2%mpi_comm, request(2), ierr)
        call MPI_Isend(Value_sub,1, ddtype_sendto_S  , comm_1d_x2%west_rank, 222, comm_1d_x2%mpi_comm, request(3), ierr)
        call MPI_Irecv(Value_sub,1, ddtype_recvfrom_N, comm_1d_x2%east_rank, 222, comm_1d_x2%mpi_comm, request(4), ierr)
        call MPI_Waitall(4, request, MPI_STATUSES_IGNORE, ierr)
        
        ! Update the ghostcells in the z-direction using derived datatypes and subcommunicator.
        call MPI_Isend(Value_sub,1, ddtype_sendto_F  , comm_1d_x3%east_rank, 111, comm_1d_x3%mpi_comm, request(1), ierr)
        call MPI_Irecv(Value_sub,1, ddtype_recvfrom_B, comm_1d_x3%west_rank, 111, comm_1d_x3%mpi_comm, request(2), ierr)
        call MPI_Isend(Value_sub,1, ddtype_sendto_B  , comm_1d_x3%west_rank, 222, comm_1d_x3%mpi_comm, request(3), ierr)
        call MPI_Irecv(Value_sub,1, ddtype_recvfrom_F, comm_1d_x3%east_rank, 222, comm_1d_x3%mpi_comm, request(4), ierr)
        call MPI_Waitall(4, request, MPI_STATUSES_IGNORE, ierr)
        !$acc end host_data
        
    end subroutine mpi_subdomain_ghostcell_update
..................
calc 'U' array (it is 3dimension array)
..................

line 779 !$acc compare(U)

call mpi_subdomain_ghostcell_update(U)

line 781 !$acc compare(U) 
line 784 !$acc update host(U) 
line 786 !$acc compare(U) 

This is terminal output

PCAST Double u in function mod_flowarray__makefld, /home/jsera.lee/apply_openACC/src/mod_flowarray.f90:781
        idx: 258 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 259 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 260 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 261 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 262 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 263 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 264 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 265 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 266 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 267 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
PCAST Double u in function mod_flowarray__makefld, /home/jsera.lee/apply_openACC/src/mod_flowarray.f90:784
        idx: 258 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 259 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 260 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 261 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 262 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 263 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 264 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 265 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 266 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 267 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
PCAST Double u in function mod_flowarray__makefld, /home/jsera.lee/apply_openACC/src/mod_flowarray.f90:786
        idx: 258 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 259 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 260 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 261 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 262 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 263 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 264 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 265 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 266 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01
        idx: 267 FAIL ABS  act: 7.03125000000000000e-01 exp: 0.00000000000000000e+00 dif: 7.03125000000000000e-01

At first i really confused about this output. but i can find some clue about this.

If you have some computation or data movement outside of the control of OpenACC that affects only host or only device memory, you have to account for that when using PCAST. For instance, an MPI data transfer to host memory would have to be copied to the device. If you use CUDA-aware MPI and do an MPI data transfer directly between two GPUs, then the host memory values are stale and must be updated. Similarly, if you have any cuBLAS or cuSolver calls on the device, then the host memory values are stale. This is likely where you have the OpenACC host_data construct, which says that something in device memory is being processed.

https://developer.nvidia.com/blog/detecting-divergence-using-pcast-to-compare-gpu-to-cpu-results/

As you can see i used ā€œ!$acc host_data use_deviceā€ because for CUDA-aware MPI or GPU-direct.
all information array is located GPU buffer so gpu to gpu communications are more efficient.
but I donā€™t know is it work properly. (This is not main question this topic)

Q. This is my Question.
line 779 !$acc compare(U) passed ā†’ cpu/gpu buffer have same data.
MPI function use host_data use_device.
line 781 !$acc compare(U) detected difference ā†’ I understand because gpu communication execution.
line 784 !$acc update host(U) detected difference ā†’ I understand it is difference cpu<->gpu data.
but 1. memcpy device to host 2. compare for detect 3. update host data using device data (it is my opinion)
so line 786 !$acc compare(U) detected difference ā†’ I canā€™t understand. line 784 already updated host data so i think it is not possible.

Could i ask what is wrong?

Hi leejsera,

I donā€™t use the compare feature myself so am not an expert here.

Iā€™m not sure what your PCAST_COMPARE setting is, but the default is to print up to 50 differences. Here, 10 are getting printed starting at index 258, meaning only part of the array shows differences. What Iā€™m wondering is given only the halos are getting update, this is causing some data to become stale?

Maybe try the following to ensure the data is in sync:

!$acc update device(U)
call mpi_subdomain_ghostcell_update(U)
!$acc update self(U) 
!$acc compare(U)

buffer so gpu to gpu communications are more efficient. but I donā€™t know is it work properly

Can you elaborate? Depending on the problem, using compare may or may not be the best way to determine if things are working as expected.

-Mat

There are many things I canā€™t understandā€¦

when i use at my code. ā€œ!$acc compare allā€
I canā€™t compile this code.

NVFORTRAN-S-0034-Syntax error at or near identifier all (mod_passivescalar.f90: 673)
0 inform, 0 warnings, 1 severes, 0 fatal for rhs_ibm_ps
make: *** [Makefile:39: obj/mod_passivescalar.o] Error 2

Q1.
I find this command ā€œ!$acc compare allā€ below references.
Are there references wrong information? or Is there anything wrong at my settings?

Q2.
When i use ā€œcall acc_compare_all()ā€
I can compile the code. but it print information without the name.
like this.
variable name, function name is printed out (null).
so it is uselessā€¦

PCAST Double (null) in function (null), (null):0
        idx: 256 FAIL ABS  act: 0.00000000000000000e+00 exp: -nan dif: nan
PCAST Long (null) in function (null), (null):0
        idx: 0 FAIL act: 0 exp: 75424
PCAST Long (null) in function (null), (null):0
        idx: 0 FAIL act: 0 exp: 192864
PCAST Double (null) in function (null), (null):0
        idx: 0 FAIL ABS  act: 0.00000000000000000e+00 exp: 3.30000000000000000e+01 dif: 3.30000000000000000e+01
        idx: 2 FAIL ABS  act: 0.00000000000000000e+00 exp: 3.40000000000000000e+01 dif: 3.40000000000000000e+01
        idx: 4 FAIL ABS  act: 0.00000000000000000e+00 exp: 3.50000000000000000e+01 dif: 3.50000000000000000e+01
        idx: 6 FAIL ABS  act: 0.00000000000000000e+00 exp: 3.60000000000000000e+01 dif: 3.60000000000000000e+01
PCAST Double (null) in function (null), (null):0
        idx: 0 FAIL ABS  act: 0.00000000000000000e+00 exp: -1.16077616512673330e+03 dif: 1.16077616512673330e+03
        idx: 1 FAIL ABS  act: 0.00000000000000000e+00 exp: -2.32126214528074888e+03 dif: 2.32126214528074888e+03
        idx: 2 FAIL ABS  act: 0.00000000000000000e+00 exp: -2.32126214528074888e+03 dif: 2.32126214528074888e+03
        idx: 3 FAIL ABS  act: 0.00000000000000000e+00 exp: -2.32126214528074888e+03 dif: 2.32126214528074888e+03
        idx: 4 FAIL ABS  act: 0.00000000000000000e+00 exp: -2.32126214528074888e+03 dif: 2.32126214528074888e+03
        idx: 5 FAIL ABS  act: 0.00000000000000000e+00 exp: -2.32126214528074888e+03 dif: 2.32126214528074888e+03
        idx: 6 FAIL ABS  act: 0.00000000000000000e+00 exp: -2.32126214528074888e+03 dif: 2.32126214528074888e+03
        idx: 7 FAIL ABS  act: 0.00000000000000000e+00 exp: -2.32126214528074888e+03 dif: 2.32126214528074888e+03

Meanwhile when i use ā€œ!$acc compare(dpsdtr_sub )ā€
It works!

PCAST Double dpsdtr_sub in function rhs_ibm_ps, /home/jsera.lee/apply_openACC/src/mod_passivescalar.f90:728
        idx: 0 FAIL ABS  act: 0.00000000000000000e+00 exp: -6.48899480404105270e+03 dif: 6.48899480404105270e+03
        idx: 1 FAIL ABS  act: 0.00000000000000000e+00 exp: -3.24449740202052726e+03 dif: 3.24449740202052726e+03
        idx: 2 FAIL ABS  act: 0.00000000000000000e+00 exp: -3.24449740202052726e+03 dif: 3.24449740202052726e+03
        idx: 3 FAIL ABS  act: 0.00000000000000000e+00 exp: -3.24449740202052726e+03 dif: 3.24449740202052726e+03
        idx: 4 FAIL ABS  act: 0.00000000000000000e+00 exp: -3.24449740202052726e+03 dif: 3.24449740202052726e+03
        idx: 5 FAIL ABS  act: 0.00000000000000000e+00 exp: -3.24449740202052726e+03 dif: 3.24449740202052726e+03
        idx: 6 FAIL ABS  act: 0.00000000000000000e+00 exp: -3.24449740202052726e+03 dif: 3.24449740202052726e+03
        idx: 7 FAIL ABS  act: 0.00000000000000000e+00 exp: -3.24449740202052726e+03 dif: 3.24449740202052726e+03

Q3.
Please tell me who should ask about PCAST, autocompare.

Your suggestion doesnā€™t work.

!$acc update device(U)
call mpi_subdomain_ghostcell_update(U)
!$acc update self(U)
!$acc compare(U)

Q1.
before MPI function, openacc copute about ā€œUā€ array so new information is only exist at ā€œgpu buffer Uā€, ā€œcpu buffer Uā€ have information but it is older values.
So I canā€™t understand why codeā€™s first line ā€œ!$acc update deviceā€.
I think if i want to sync cpu/gpu, I have to use ā€œ!$acc update hostā€

Q2.
At mpi functions i use ā€œ!$acc host_data use_device(Value_sub)ā€
So as i thinking it communicate gpu buffer to gpu buffer.
so gpu bufferā€™s data is new.
After MPI Function, I have to !$acc update host and then cpu/gpu are sync.
Is there anything wrong at my guess?


In detail.

acc compute region about ā€˜Uā€™
MPI communication for ghost cell ā€˜Uā€™
acc compute region about ā€˜Uā€™

(U is already exist at gpu buffer)
So I think , and MPI communication gpu ā†” gpu will be more efficient.
like cuda-aware MPI

Some example at online, they use below process.
I think this 2 memory copy is unneccessary.

acc compute region about ā€˜Uā€™
acc update host
MPI communication for ghost cell ā€˜Uā€™
acc update device
acc compute region about ā€˜Uā€™

!$acc update host(sigmazz,sigmayz,sigmaxz)
! sigmazz(k+1), left shift
  call MPI_SENDRECV(sigmazz(:,:,1),number_of_values,MPI_DOUBLE_PRECISION, &
         receiver_left_shift,message_tag,sigmazz(:,:,NZ_LOCAL+1), &
         number_of_values,

...

!$acc update device(sigmazz,sigmayz,sigmaxz)

I think the references are wrong, or for some reason we didnā€™t implement ā€œacc compare allā€. Though you can use the ā€œcall acc_compare_all()ā€.

Please tell me who should ask about PCAST, autocompare.

Iā€™ll need to create a reproducing example and ask engineering for input. Though it wonā€™t be until Tuesday when our office reopens after a U.S. Holiday.

before MPI function, openacc copute about ā€œUā€ array so new information is only exist at ā€œgpu buffer Uā€, ā€œcpu buffer Uā€ have information but it is older values.
So I canā€™t understand why codeā€™s first line ā€œ!$acc update deviceā€.
I think if i want to sync cpu/gpu, I have to use ā€œ!$acc update hostā€

The only reason to add the updates is for the compare functions to work since the MPI calls are outside of the autocompare featureā€™s ability to track. They arenā€™t needed nor desired without autocompare since the data copies would extraneous.

Iā€™ll ask again, what is the core issue that youā€™re trying to solve? Depending on the problem, auto-compare may or may not be the best tool to help solve it and spending time working through these issues may not be beneficial.

1 Like

Q. I wonder that is it my setting problem? or this is also non-implemented?
when using call acc_compare_all() it printed (null)

The only reason to add the updates is for the compare functions to work since the MPI calls are outside of the autocompare featureā€™s ability to track. They arenā€™t needed nor desired without autocompare since the data copies would extraneous.

I understood your words.

Iā€™ll ask again, what is the core issue that youā€™re trying to solve? Depending on the problem, auto-compare may or may not be the best tool to help solve it and spending time working through these issues may not be beneficial.

The problem I want to solve is comparative verification. (baseline is CPU results.)
Parallelization was applied using OpenACC to a large project.
I was using PCast AutoCompare to compare it with the CPU results.
I was able to find and modify my many mistakes using AutoCompare.
AutoCompare seems to be a very useful and powerful verification tool!

When i asked at first, I tried to gpu_mem ā†” gpu_mem MPI communication. Using !$acc host_data use_device()

However, due to the problem of AutoCompare can not notice GPU-MPI communication, now I try to 2 steps.

  1. Verification of OpenACC Calculation Results using AutoCompare.
  2. Change CPU-MPI communication to GPU-MPI communication.

Current my code status.

!$acc compare(U)
!$acc update host(U)

call mpi_subdomain_ghostcell_update(U)

!$acc update device(U) 
!$acc compare(U)

and no use !$acc host_data use_device() in mpi_subdomain_ghostcell_update subroutine.

It is Inefficient because of device ā†” host memory copy 2times, but verification is first!
I will first verify the calculation results, after will apply CUDA-aware MPI or GPU RDMA.

I am always grateful for the fast, accurate and friendly answer.
Thnks Mat.

Sounds like the best strategy. Iā€™m not 100% sure PCAST is supposed to work with host_data, but the PCAST article implies that it does provided you add the extra updates so the memories are in sync. Michael, who lead this effort, has since retired so I canā€™t ask him, and Adam, the co-author, hasnā€™t gotten back to me yet.

It is Inefficient because of device ā†” host memory copy 2times, but verification is first!
I will first verify the calculation results, after will apply CUDA-aware MPI or GPU RDMA.

Autocompare itself is very slow given the loops get run on both the host and device. Iā€™d only use it for verification and disable it for production or performance testing.

-Mat

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.