Hello, everyone.
I am trying to develop a code that should be executed on multiple GPUs simultaneously. The end-goal is to make different GPUs execute different chunks of one global outer loop. Each GPU will have all data copied to it. I am utilizing the “async” feature of OpenACC “parallel loop”. For just 2 nested collapsed loops, the code works fine, and all 4 GPUs in my system seem to run simultaneously.
However, if I add an internal loop (seq, in my case) inside of the collapsed loops, the code stops working as intended, and is executed sequentially, GPU after GPU. It does not matter what is inside that loop, the simultaneous execution breaks.
Below is the reproducible example:
program reproducible_example
use cudafor
use openacc
implicit none
integer :: ngpus, gp1, gp2, ia, ia2
integer, value :: devicenum
integer(acc_device_kind), value :: devicetype
integer(acc_device_property) :: property
character*(1000) :: string
integer,allocatable :: count_arr(:), dummy_add_arr(:)
real :: time_dat1, time_dat2
write(*,*) '-------- Start Multi-GPU test --------'
devicetype = acc_get_device_type()
ngpus = acc_get_num_devices(devicetype)
if (allocated(count_arr)) deallocate(count_arr)
if (allocated(dummy_add_arr)) deallocate(dummy_add_arr)
if (.not.allocated(count_arr)) allocate(count_arr(ngpus))
if (.not.allocated(dummy_add_arr)) allocate(dummy_add_arr(ngpus))
call acc_get_property_string(devicenum,devicetype,property,string)
write(*,*) 'Device type:',trim(string)
count_arr(:) = 0_4
dummy_add_arr(:) = 999_4
write(*,*) 'Values of dummy_add_arr: ',dummy_add_arr
do devicenum=0,ngpus - 1
call acc_set_device_num(devicenum,acc_device_nvidia)
!$acc enter data copyin(count_arr,dummy_add_arr)
write(*,*) ' Data was copied to GPU ',devicenum+1
enddo
call cpu_time(time_dat1)
do devicenum=0,ngpus - 1 ! iterate the GPUs
call acc_set_device_num(devicenum,acc_device_nvidia)
write(*,'(A20,I4,A3,I4)') ' Current GPU ID is: ',devicenum+1,' / ',ngpus
!$acc parallel loop seq collapse(2) reduction(+:dummy_add_arr) &
!$acc& present(count_arr,dummy_add_arr) async
do gp1 = 1,5000
do gp2 = 1,100000
dummy_add_arr(devicenum+1) = dummy_add_arr(devicenum+1) + gp2*2 - gp1
! Here I want to see when certain points of calculation are reached by each GPU
if(gp1.eq.1000.and.gp2.eq.60000) then
write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
endif
if(gp1.eq.2000.and.gp2.eq.2000) then
write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
endif
if(gp1.eq.3000.and.gp2.eq.97000) then
write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
endif
if(gp1.eq.4000.and.gp2.eq.3000) then
write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
endif
if(gp1.eq.5000.and.gp2.eq.55000) then
write(*,*) 'GPU device: ',devicenum+1,gp1,gp2
endif
! this line is triggering sequential execution on multi-GPU
!$acc loop seq
do ia = 1,100
enddo
enddo
enddo
write(*,'(A26,I4,A3,I4)') ' Finished running on GPU: ',devicenum+1,' / ',ngpus
enddo
!$acc wait
write(*,*) 'Final result (CPU): ',dummy_add_arr
do devicenum=0,ngpus - 1
call acc_set_device_num(devicenum,acc_device_nvidia)
!$acc update self(dummy_add_arr(devicenum+1))
write(*,*) ' Data was copied from GPU ',devicenum+1
enddo
write(*,*) 'Final result (GPU): ',dummy_add_arr
do devicenum=0,ngpus - 1
call acc_set_device_num(devicenum,acc_device_nvidia)
!$acc exit data delete(count_arr,dummy_add_arr)
write(*,*) ' Data was removed from GPU ',devicenum+1
enddo
call cpu_time(time_dat2)
write(*,'(A23,F11.5,A4)') ' Total execution time: ',time_dat2 - time_dat1, ' sec'
write(*,*) '-------- Finish Multi-GPU test --------'
if (allocated(count_arr)) deallocate(count_arr)
if (allocated(dummy_add_arr)) deallocate(dummy_add_arr)
end program reproducible_example
When the “do ia = 1,100” loop is removed, the code produces the following output (let’s call it Case 1):
[siarhei@pgpu02 multigpu_openacc]$ sh RE_compile_and_run.sh
reproducible_example:
29, Generating enter data copyin(dummy_add_arr(:),count_arr(:))
37, Generating present(count_arr(:),dummy_add_arr(:))
Generating NVIDIA GPU code
37, Generating reduction(+:dummy_add_arr(:))
39, !$acc loop seq collapse(2)
40, collapsed
74, Generating update self(dummy_add_arr(devicenum+1))
81, Generating exit data delete(dummy_add_arr(:),count_arr(:))
-------- Start Multi-GPU test --------
Device type:
Values of dummy_add_arr: 999 999 999 999
Data was copied to GPU 1
Data was copied to GPU 2
Data was copied to GPU 3
Data was copied to GPU 4
Current GPU ID is: 1 / 4
Finished running on GPU: 1 / 4
Current GPU ID is: 2 / 4
Finished running on GPU: 2 / 4
Current GPU ID is: 3 / 4
Finished running on GPU: 3 / 4
Current GPU ID is: 4 / 4
Finished running on GPU: 4 / 4
GPU device: 3 1000 60000
GPU device: 2 1000 60000
GPU device: 1 1000 60000
GPU device: 4 1000 60000
GPU device: 3 2000 2000
GPU device: 2 2000 2000
GPU device: 1 2000 2000
GPU device: 4 2000 2000
GPU device: 3 3000 97000
GPU device: 2 3000 97000
GPU device: 1 3000 97000
GPU device: 4 3000 97000
GPU device: 3 4000 3000
GPU device: 2 4000 3000
GPU device: 1 4000 3000
GPU device: 4 4000 3000
GPU device: 3 5000 55000
GPU device: 2 5000 55000
GPU device: 1 5000 55000
GPU device: 4 5000 55000
Final result (CPU): 999 999 999 999
Data was copied from GPU 1
Data was copied from GPU 2
Data was copied from GPU 3
Data was copied from GPU 4
Final result (GPU): -1923775897 -1923775897 -1923775897 -1923775897
Data was removed from GPU 1
Data was removed from GPU 2
Data was removed from GPU 3
Data was removed from GPU 4
Total execution time: 60.88019 sec
-------- Finish Multi-GPU test --------
However, if the “do ia = 1,100” loop is not removed, the output is the following (let’s call it Case 2):
[siarhei@pgpu02 multigpu_openacc]$ sh RE_compile_and_run.sh
reproducible_example:
29, Generating enter data copyin(dummy_add_arr(:),count_arr(:))
60, Accelerator restriction: unsupported statement type: opcode=ACCPLOOP
74, Generating update self(dummy_add_arr(devicenum+1))
81, Generating exit data delete(dummy_add_arr(:),count_arr(:))
-------- Start Multi-GPU test --------
Device type:
Values of dummy_add_arr: 999 999 999 999
Data was copied to GPU 1
Data was copied to GPU 2
Data was copied to GPU 3
Data was copied to GPU 4
Current GPU ID is: 1 / 4
GPU device: 1 1000 60000
GPU device: 1 2000 2000
GPU device: 1 3000 97000
GPU device: 1 4000 3000
GPU device: 1 5000 55000
Finished running on GPU: 1 / 4
Current GPU ID is: 2 / 4
GPU device: 2 1000 60000
GPU device: 2 2000 2000
GPU device: 2 3000 97000
GPU device: 2 4000 3000
GPU device: 2 5000 55000
Finished running on GPU: 2 / 4
Current GPU ID is: 3 / 4
GPU device: 3 1000 60000
GPU device: 3 2000 2000
GPU device: 3 3000 97000
GPU device: 3 4000 3000
GPU device: 3 5000 55000
Finished running on GPU: 3 / 4
Current GPU ID is: 4 / 4
GPU device: 4 1000 60000
GPU device: 4 2000 2000
GPU device: 4 3000 97000
GPU device: 4 4000 3000
GPU device: 4 5000 55000
Finished running on GPU: 4 / 4
Final result (CPU): -1923775897 -1923775897 -1923775897 -1923775897
Data was copied from GPU 1
Data was copied from GPU 2
Data was copied from GPU 3
Data was copied from GPU 4
Final result (GPU): 999 999 999 999
Data was removed from GPU 1
Data was removed from GPU 2
Data was removed from GPU 3
Data was removed from GPU 4
Total execution time: 4.30303 sec
-------- Finish Multi-GPU test --------
The code is compiled and executed using the following SH file:
nvfortran -fast -cuda -mp -acc -gpu=cc86,deepcopy,cuda11.7 -Mlarge_arrays -cpp -Minfo=accel -Mbackslash -o=run_reproducible_ex reproducible_example.f90 #
./run_reproducible_ex
The compiler is NVFORTRAN 22.9. The OS is CentOS 7 3.10.0-957.el7.x86_64. The CPU is AMD EPYC 7543 (dual socket). The GPUs are NVIDIA RTX A5000 (4 cards). nvidia-smi shows the following information: NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7.
I have a few questions given the shown results:
- Is there a way to make my code work on multiple GPUs with the “do ia = 1,100” loop (Case 2)? The loop should stay sequential.
- Why is there such a drastic reduction in execution time when added the “do ia = 1,100” loop?
- Why does the value of the “dummy_add_arr” change before being updated from the GPU in Case 2 and does not in Case 1?
- What is the meaning of compiler message “60, Accelerator restriction: unsupported statement type: opcode=ACCPLOOP” in Case 2?
The way I judge whether the code is executed sequentially on GPUs is by looking at the output sequence. In case of parallel execution, all GPUs reach certain checkpoints simultaneously (Case 1) and print out simultaneously. In the case of sequential run, each GPU runs through the entire list of checkpoints and prints it out before the next GPU starts running (Case 2).
I would appreciate any helpful feedback on what is going on with the code here.