Parallel (async) execution of an OpenACC loop on multiple GPUs is not working when added a nested seq loop (Fortran)

Hello, everyone.

I am trying to develop a code that should be executed on multiple GPUs simultaneously. The end-goal is to make different GPUs execute different chunks of one global outer loop. Each GPU will have all data copied to it. I am utilizing the “async” feature of OpenACC “parallel loop”. For just 2 nested collapsed loops, the code works fine, and all 4 GPUs in my system seem to run simultaneously.
However, if I add an internal loop (seq, in my case) inside of the collapsed loops, the code stops working as intended, and is executed sequentially, GPU after GPU. It does not matter what is inside that loop, the simultaneous execution breaks.
Below is the reproducible example:

    program reproducible_example
        use cudafor
        use openacc        
        implicit none
        integer                          :: ngpus, gp1, gp2, ia, ia2
        integer, value                   :: devicenum
        integer(acc_device_kind), value  :: devicetype
        integer(acc_device_property)     :: property
        character*(1000)                 :: string
        integer,allocatable              :: count_arr(:), dummy_add_arr(:)
        real                             :: time_dat1, time_dat2


        write(*,*) '-------- Start Multi-GPU test --------'
        devicetype = acc_get_device_type()
        ngpus = acc_get_num_devices(devicetype)

        if (allocated(count_arr))     deallocate(count_arr)
        if (allocated(dummy_add_arr)) deallocate(dummy_add_arr)
        if (.not.allocated(count_arr))     allocate(count_arr(ngpus))
        if (.not.allocated(dummy_add_arr)) allocate(dummy_add_arr(ngpus))
        call acc_get_property_string(devicenum,devicetype,property,string)
        write(*,*) 'Device type:',trim(string)
        count_arr(:) = 0_4
        dummy_add_arr(:) = 999_4
        write(*,*) 'Values of dummy_add_arr: ',dummy_add_arr
        do devicenum=0,ngpus - 1
            call acc_set_device_num(devicenum,acc_device_nvidia)
            !$acc enter data copyin(count_arr,dummy_add_arr)
            write(*,*) ' Data was copied to GPU ',devicenum+1
        enddo
        call cpu_time(time_dat1)
        do devicenum=0,ngpus - 1 ! iterate the GPUs
            call acc_set_device_num(devicenum,acc_device_nvidia)
            write(*,'(A20,I4,A3,I4)') ' Current GPU ID is: ',devicenum+1,' / ',ngpus

            !$acc parallel loop seq collapse(2) reduction(+:dummy_add_arr)  & 
            !$acc& present(count_arr,dummy_add_arr) async
            do gp1 = 1,5000
                do gp2 = 1,100000
                    dummy_add_arr(devicenum+1) = dummy_add_arr(devicenum+1) + gp2*2 - gp1
                    ! Here I want to see when certain points of calculation are reached by each GPU
                    if(gp1.eq.1000.and.gp2.eq.60000) then
                        write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
                    endif
                    if(gp1.eq.2000.and.gp2.eq.2000) then
                        write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
                    endif
                    if(gp1.eq.3000.and.gp2.eq.97000) then
                        write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
                    endif
                    if(gp1.eq.4000.and.gp2.eq.3000) then
                        write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
                    endif
                    if(gp1.eq.5000.and.gp2.eq.55000) then
                        write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
                    endif

                    ! this line is triggering sequential execution on multi-GPU
                    !$acc loop seq
                    do ia = 1,100
                        
                    enddo
                enddo

            enddo
            write(*,'(A26,I4,A3,I4)') ' Finished running on GPU: ',devicenum+1,' / ',ngpus
        enddo
        !$acc wait

        write(*,*) 'Final result (CPU): ',dummy_add_arr
        do devicenum=0,ngpus - 1
            call acc_set_device_num(devicenum,acc_device_nvidia)
            !$acc update self(dummy_add_arr(devicenum+1))
            write(*,*) ' Data was copied from GPU ',devicenum+1
        enddo

        write(*,*) 'Final result (GPU): ',dummy_add_arr
        do devicenum=0,ngpus - 1
            call acc_set_device_num(devicenum,acc_device_nvidia)
            !$acc exit data delete(count_arr,dummy_add_arr)
            write(*,*) ' Data was removed from GPU ',devicenum+1
        enddo
        
        call cpu_time(time_dat2)
        write(*,'(A23,F11.5,A4)') ' Total execution time: ',time_dat2 - time_dat1, ' sec'
        write(*,*) '-------- Finish Multi-GPU test --------'

        if (allocated(count_arr))     deallocate(count_arr)
        if (allocated(dummy_add_arr)) deallocate(dummy_add_arr)

    end program reproducible_example

When the “do ia = 1,100” loop is removed, the code produces the following output (let’s call it Case 1):

[siarhei@pgpu02 multigpu_openacc]$ sh RE_compile_and_run.sh
reproducible_example:
     29, Generating enter data copyin(dummy_add_arr(:),count_arr(:))
     37, Generating present(count_arr(:),dummy_add_arr(:))
         Generating NVIDIA GPU code
         37, Generating reduction(+:dummy_add_arr(:))
         39, !$acc loop seq collapse(2)
         40,   collapsed
     74, Generating update self(dummy_add_arr(devicenum+1))
     81, Generating exit data delete(dummy_add_arr(:),count_arr(:))
 -------- Start Multi-GPU test --------
 Device type:

 Values of dummy_add_arr:           999          999          999          999
  Data was copied to GPU             1
  Data was copied to GPU             2
  Data was copied to GPU             3
  Data was copied to GPU             4
 Current GPU ID is:    1 /    4
 Finished running on GPU:    1 /    4
 Current GPU ID is:    2 /    4
 Finished running on GPU:    2 /    4
 Current GPU ID is:    3 /    4
 Finished running on GPU:    3 /    4
 Current GPU ID is:    4 /    4
 Finished running on GPU:    4 /    4
 GPU device:             3         1000        60000
 GPU device:             2         1000        60000
 GPU device:             1         1000        60000
 GPU device:             4         1000        60000
 GPU device:             3         2000         2000
 GPU device:             2         2000         2000
 GPU device:             1         2000         2000
 GPU device:             4         2000         2000
 GPU device:             3         3000        97000
 GPU device:             2         3000        97000
 GPU device:             1         3000        97000
 GPU device:             4         3000        97000
 GPU device:             3         4000         3000
 GPU device:             2         4000         3000
 GPU device:             1         4000         3000
 GPU device:             4         4000         3000
 GPU device:             3         5000        55000
 GPU device:             2         5000        55000
 GPU device:             1         5000        55000
 GPU device:             4         5000        55000
 Final result (CPU):           999          999          999          999
  Data was copied from GPU             1
  Data was copied from GPU             2
  Data was copied from GPU             3
  Data was copied from GPU             4
 Final result (GPU):   -1923775897  -1923775897  -1923775897  -1923775897
  Data was removed from GPU             1
  Data was removed from GPU             2
  Data was removed from GPU             3
  Data was removed from GPU             4
 Total execution time:    60.88019 sec
 -------- Finish Multi-GPU test --------

However, if the “do ia = 1,100” loop is not removed, the output is the following (let’s call it Case 2):

[siarhei@pgpu02 multigpu_openacc]$ sh RE_compile_and_run.sh
reproducible_example:
     29, Generating enter data copyin(dummy_add_arr(:),count_arr(:))
     60, Accelerator restriction: unsupported statement type: opcode=ACCPLOOP
     74, Generating update self(dummy_add_arr(devicenum+1))
     81, Generating exit data delete(dummy_add_arr(:),count_arr(:))
 -------- Start Multi-GPU test --------
 Device type:

 Values of dummy_add_arr:           999          999          999          999
  Data was copied to GPU             1
  Data was copied to GPU             2
  Data was copied to GPU             3
  Data was copied to GPU             4
 Current GPU ID is:    1 /    4
 GPU device:             1         1000        60000
 GPU device:             1         2000         2000
 GPU device:             1         3000        97000
 GPU device:             1         4000         3000
 GPU device:             1         5000        55000
 Finished running on GPU:    1 /    4
 Current GPU ID is:    2 /    4
 GPU device:             2         1000        60000
 GPU device:             2         2000         2000
 GPU device:             2         3000        97000
 GPU device:             2         4000         3000
 GPU device:             2         5000        55000
 Finished running on GPU:    2 /    4
 Current GPU ID is:    3 /    4
 GPU device:             3         1000        60000
 GPU device:             3         2000         2000
 GPU device:             3         3000        97000
 GPU device:             3         4000         3000
 GPU device:             3         5000        55000
 Finished running on GPU:    3 /    4
 Current GPU ID is:    4 /    4
 GPU device:             4         1000        60000
 GPU device:             4         2000         2000
 GPU device:             4         3000        97000
 GPU device:             4         4000         3000
 GPU device:             4         5000        55000
 Finished running on GPU:    4 /    4
 Final result (CPU):   -1923775897  -1923775897  -1923775897  -1923775897
  Data was copied from GPU             1
  Data was copied from GPU             2
  Data was copied from GPU             3
  Data was copied from GPU             4
 Final result (GPU):           999          999          999          999
  Data was removed from GPU             1
  Data was removed from GPU             2
  Data was removed from GPU             3
  Data was removed from GPU             4
 Total execution time:     4.30303 sec
 -------- Finish Multi-GPU test --------

The code is compiled and executed using the following SH file:

nvfortran -fast -cuda -mp -acc -gpu=cc86,deepcopy,cuda11.7 -Mlarge_arrays -cpp -Minfo=accel -Mbackslash -o=run_reproducible_ex reproducible_example.f90 #
./run_reproducible_ex

The compiler is NVFORTRAN 22.9. The OS is CentOS 7 3.10.0-957.el7.x86_64. The CPU is AMD EPYC 7543 (dual socket). The GPUs are NVIDIA RTX A5000 (4 cards). nvidia-smi shows the following information: NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7.

I have a few questions given the shown results:

  1. Is there a way to make my code work on multiple GPUs with the “do ia = 1,100” loop (Case 2)? The loop should stay sequential.
  2. Why is there such a drastic reduction in execution time when added the “do ia = 1,100” loop?
  3. Why does the value of the “dummy_add_arr” change before being updated from the GPU in Case 2 and does not in Case 1?
  4. What is the meaning of compiler message “60, Accelerator restriction: unsupported statement type: opcode=ACCPLOOP” in Case 2?

The way I judge whether the code is executed sequentially on GPUs is by looking at the output sequence. In case of parallel execution, all GPUs reach certain checkpoints simultaneously (Case 1) and print out simultaneously. In the case of sequential run, each GPU runs through the entire list of checkpoints and prints it out before the next GPU starts running (Case 2).

I would appreciate any helpful feedback on what is going on with the code here.

Hi s.dzianisau,

Notice the following error in your compiler feedback.

Since the inner loop is empty, the compiler removes it. Hence there’s no loop to apply the “acc loop seq” directive, and the GPU compilation fails and the host fallback is used.

  1. Is there a way to make my code work on multiple GPUs with the “do ia = 1,100” loop (Case 2)? The loop should stay sequential.

Add some code so the loop isn’t removed.

  1. Why is there such a drastic reduction in execution time when added the “do ia = 1,100” loop?

Because it’s running on the host. The code is very slow on the GPU given you use the “seq” clause on the “parallel” directive which causes the code to run sequentially on the device.

  1. Why does the value of the “dummy_add_arr” change before being updated from the GPU in Case 2 and does not in Case 1?

No device code is being generated.

  1. What is the meaning of compiler message “60, Accelerator restriction: unsupported statement type: opcode=ACCPLOOP” in Case 2?

Occurs when a “loop” directive is applied to a loop that gets removed due to dead-code elimination.

In general, I recommend using MPI+OpenACC for multi-GPU programing. It’s typically more straight forward to program given domain decomposition is inherent in MPI, gives a one-to-one mapping of rank to GPU, and extends the program to run across multiple nodes.

-Mat