Async Issue: Dual GPU Parallel Execution Runs Sequentially

Hello everyone, I encountered a problem where multiple GPUs fail to run in parallel. After using async, the two tasks first run on GPU 0 and then on GPU 1, instead of running on both GPUs simultaneously as I expected. The code is as follows:

program multi_gpu_async_example
implicit none
integer, parameter :: n = 30000
integer, parameter :: m = 20000
real, dimension(n, m) :: input_array, output_array_device0, output_array_device1
integer :: i, j, device_num
real :: start_time, end_time, total_time
real :: device0_start, device0_end, device1_start, device1_end

input_array = 1.0
output_array_device0 = 0.0
output_array_device1 = 0.0

call cpu_time(start_time)

device_num = 0
!$acc set device_num(device_num)
call cpu_time(device0_start)


!$acc data copyin(input_array) copyout(output_array_device0) async(1)
!$acc parallel loop gang vector collapse(2) private(i, j) async(1)
do i = 1, n
   do j = 1, m
      output_array_device0(i, j) = input_array(i, j) * 2.0
   end do
end do
!$acc end parallel
!$acc wait(1)
!$acc end data

call cpu_time(device0_end)

device_num = 1
!$acc set device_num(device_num)
call cpu_time(device1_start)

!$acc data copyin(input_array) copyout(output_array_device1) async(2)
!$acc parallel loop gang vector collapse(2) private(i, j) async(2)
do i = 1, n
   do j = 1, m
      output_array_device1(i, j) = input_array(i, j) * 3.0  ! 设备 1 的计算
   end do
end do
!$acc end parallel
!$acc wait(2)
!$acc end data
call cpu_time(device1_end)


call cpu_time(end_time)
total_time = end_time - start_time

print *, "Total execution time (seconds): ", total_time
print *, "Device 0 execution time (seconds): ", device0_end - device0_start
print *, "Device 1 execution time (seconds): ", device1_end - device1_start

end program

output as follow,The ideal execution time should be the longer runtime of the two GPUs:

Total execution time (seconds):     2.847902
 Device 0 execution time (seconds):     1.736411
 Device 1 execution time (seconds):    0.9045539

Accelerator Kernel Timing data
    Timing may be affected by asynchronous behavior
    set PGI_ACC_SYNCHRONOUS to 1 to disable async() clauses
/home/test_acc/multi_gpu_async_example.f90
  multi_gpu_async_example  NVIDIA  devicenum=0
    time(us): 191,837
    21: data region reached 2 times
        21: data copyin transfers: 144
             device time(us): total=99,625 max=699 min=56 avg=691
        30: data copyout transfers: 144
             device time(us): total=92,212 max=667 min=45 avg=640
    22: compute region reached 1 time
        22: kernel launched 1 time
            grid: [4687500]  block: [128]
            elapsed time(us): total=2,406 max=2,406 min=2,406 avg=2,406
/home/test_acc/multi_gpu_async_example.f90
  multi_gpu_async_example  NVIDIA  devicenum=1
    time(us): 191,417
    38: data region reached 2 times
        38: data copyin transfers: 144
             device time(us): total=99,088 max=702 min=62 avg=688
        47: data copyout transfers: 144
             device time(us): total=92,329 max=666 min=45 avg=641
    39: compute region reached 1 time
        39: kernel launched 1 time
            grid: [1]  block: [1]
            elapsed time(us): total=27 max=27 min=27 avg=27

There’s a few reasons why this happening. First, you have a “wait” between the compute regions which will cause the host thread to block. Second, data transfers while can be done concurrent with a compute region, they get serialized between multiple transfers. Finally, when using the internal profiler (i.e. ACC_TIME), it needs to disable async in order to get correct timings.

To fix this, I rewrote your code to separate the data regions from the compute regions, only use “wait” at the beginning and the end. I also put in waits for the timer calls. These are optional, and only needed if you want accurate times.

Note that I also decided to launch each compute region 100 times to get better timings.

Here’s the modified code:

program multi_gpu_async_example
use openacc
implicit none
integer, parameter :: n = 30000
integer, parameter :: m = 20000
real, dimension(n, m) :: input_array, output_array_device0, output_array_device1
integer :: i, j, device_num
real :: start_time, end_time, total_time
real :: data_in_start, data_in_end
real :: data_out_start, data_out_end
real :: compute_start, compute_end
integer :: niter, ni

niter = 100
input_array = 1.0
output_array_device0 = 0.0
output_array_device1 = 0.0

call acc_init_device(0,acc_get_device_type())
call acc_init_device(1,acc_get_device_type())

call cpu_time(start_time)
call cpu_time(data_in_start)

! Create data on device 0
device_num = 0
!$acc set device_num(device_num)
!$acc enter data copyin(input_array) create(output_array_device0) async(1)

! Create data on device 1
device_num = 1
!$acc set device_num(device_num)
!$acc enter data copyin(input_array) create(output_array_device1) async(2)

! add wait so the timer can be used
!$acc wait
call cpu_time(data_in_end)
call cpu_time(compute_start)

! Start compute on device 0
device_num = 0
!$acc set device_num(device_num)
do ni=1,niter
!$acc parallel loop gang vector collapse(2) async(1)
do i = 1, n
   do j = 1, m
      output_array_device0(i, j) = input_array(i, j) * 2.0
   end do
end do
!$acc end parallel
end do

! Start compute on device 1
device_num = 1
!$acc set device_num(device_num)
do ni=1,niter
!$acc parallel loop gang vector collapse(2) private(i, j) async(2)
do i = 1, n
   do j = 1, m
      output_array_device1(i, j) = input_array(i, j) * 3.0  ! 设备 1 的计算
   end do
end do
!$acc end parallel
end do

!$acc wait
call cpu_time(compute_end)
call cpu_time(data_out_start)
!$acc exit data delete(input_array) copyout(output_array_device1) async(2)

device_num = 0
!$acc set device_num(device_num)
!$acc exit data delete(input_array) copyout(output_array_device0) async(1)
!$acc wait
call cpu_time(data_out_end)
call cpu_time(end_time)
total_time = end_time - start_time

print *,  output_array_device0(1,1)
print *,  output_array_device1(1,1)

print *, "Total execution time (seconds): ", total_time
print *, "Data to GPU (seconds): ", data_in_end - data_in_start
print *, "Compute time (seconds): ", compute_end - compute_start
print *, "Data from GPU (seconds): ", data_out_end - data_out_start

end program

Then run the code twice, with and without the internal profiler. You can then compare the “compute” time in the first run with the profile’s kernel time.

% nvfortran -acc -Ofast test.F90
% a.out
    2.000000
    3.000000
 Total execution time (seconds):     4.345081
 Data to GPU (seconds):    0.6381099
 Compute time (seconds):     2.683934
 Data from GPU (seconds):     1.023035
% setenv NV_ACC_TIME 1
% a.out
    2.000000
    3.000000
 Total execution time (seconds):     7.268367
 Data to GPU (seconds):    0.7806239
 Compute time (seconds):     5.441762
 Data from GPU (seconds):     1.045979

Accelerator Kernel Timing data
    Timing may be affected by asynchronous behavior
    set PGI_ACC_SYNCHRONOUS to 1 to disable async() clauses
/local/home/mcolgrove/test.F90
  multi_gpu_async_example  NVIDIA  devicenum=1
    time(us): 2,910,210
    33: data region reached 1 time
        33: data copyin transfers: 144
             device time(us): total=93,329 max=1,324 min=94 avg=648
    57: compute region reached 100 times
        57: kernel launched 100 times
            grid: [4687500]  block: [128]
             device time(us): total=2,722,403 max=31,972 min=26,659 avg=27,224
            elapsed time(us): total=2,727,322 max=32,006 min=26,681 avg=27,273
    57: data region reached 200 times
    69: data region reached 1 time
        69: data copyout transfers: 144
             device time(us): total=94,478 max=1,352 min=107 avg=656
/local/home/mcolgrove/test.F90
  multi_gpu_async_example  NVIDIA  devicenum=0
    time(us): 2,894,442
    28: data region reached 1 time
        28: data copyin transfers: 144
             device time(us): total=93,598 max=1,323 min=92 avg=649
    44: compute region reached 100 times
        44: kernel launched 100 times
            grid: [4687500]  block: [128]
             device time(us): total=2,699,688 max=31,949 min=26,595 avg=26,996
            elapsed time(us): total=2,713,714 max=32,007 min=26,617 avg=27,137
    44: data region reached 200 times
    73: data region reached 1 time
        73: data copyout transfers: 144
             device time(us): total=101,156 max=711 min=58 avg=702

Thank you! I think I found the reason. It’s because I set NV_ACC_TIME=1 in the default environment variables, which caused async to be disabled.

Hello, I have a small question regarding data transfer.Part of code is as follows: the data input and execution work fine, but after the data is transferred out, all the values are 0.

do device_num = 0, GPU_COUNT - 1

    ! enter data for each GPU
    !$ACC set device_num(device_num)
    !$acc enter data copyin(data1_dv(device_num+1,:,:), data2_dv(device_num+1,:,:)) &
    !$acc& create(output1_dv(device_num+1,:,:), output2_dv(device_num+1,:,:)) async(device_num+1)

end do
!$acc wait

do device_num = 0, GPU_COUNT - 1

    call calculate(data1_dv(device_num+1,:,:), data2_dv(device_num+1,:,:),output1_dv(device_num+1,:,:), output2_dv(device_num+1,:,:),device_num,GPU_COUNT)

end do

!$acc wait
do device_num = 0, GPU_COUNT - 1

    !$ACC set device_num(device_num)
    !$acc exit data delete(data1_dv(device_num+1,:,:), data2_dv(device_num+1,:,:))&
    !$acc& copyoutoutput1_dv(device_num+1,:,:), output2_dv(device_num+1,:,:)) async(device_num+1)

end do

subroutine calculate(data1,data2,output1,output2,device_num,GPU_COUNT)
    real, dimension(:,:), intent(in) :: data1,data2
    integer, value,intent(in)            ::  device_num,GPU_COUNT
    real, dimension(:,:),intent(out) :: output1,output2
    integer :: i,istart,iend
    !$ACC set device_num(device_num)

    !$acc parallel loop gang vector collapse(2) private(...) async(device_num+1)
    do i = istart, iend
         ----calculate---
    end do
    print(output)  !  Here is right
    end parallel
end suboutine

Can you post a complete reproducing example?

Since there’s missing code, I’m not sure if you’re meaning that you’re printing after the exit data do loop and if so, if you forgot to include the wait directive so the data is still transferring as the host thread prints.

Or if you’re meaning the print statement in “calculate”. Though you do have a comment saying “this is right”, so this might be working correctly.

Now I do see what appears to be a typo:

!$acc& copyoutoutput1_dv(device_num+1,:,:), output2_dv(device_num+1,:,:)) async(device_num+1)

Should be:

!$acc& copyout(output1_dv(device_num+1,:,:), output2_dv(device_num+1,:,:)) async(device_num+1)

If this is in you’re code and not just a copy error, then directive may not be getting compiled. I’d actually expect a syntax error for this.

This is a relatively complete code. It prints correctly during the calculation process, but after the calculation is complete and the output data is printed, all the printed results are zeros.

allocate(array_sp1_dv(GPU_COUNT,chunk_size,xn),array_dir1_dv(GPU_COUNT,chunk_size,xn),array_sp2_dv(GPU_COUNT,chunk_size,xn),array_dir2_dv(GPU_COUNT,chunk_size,xn),CC1_dv(GPU_COUNT,4,chunk_size,xn),CC2_dv(GPU_COUNT,4,chunk_size,xn)) !output
call divide_data(data1, data2, data3, lon,lat, GPU_COUNT, data1_dv, data2_dv, data3_dv, lon_dv, lat_dv)

do device_num = 0, GPU_COUNT - 1

    ! enter data for each GPU
    !$ACC set device_num(device_num)
    !$acc enter data copyin(data1_dv(device_num+1,:,:), data2_dv(device_num+1,:,:), data3_dv(device_num+1,:,:), lon_dv(device_num+1,:,:), lat_dv(device_num+1,:,:), qf) &
    !$acc& create(array_sp1_dv(device_num+1,:,:), array_dir1_dv(device_num+1,:,:),array_sp2_dv(device_num+1,:,:), array_dir2_dv(device_num+1,:,:), CC1_dv(device_num+1,:,:,:), CC2_dv(device_num+1,:,:,:)) async(device_num+1)

end do

!$acc wait
do device_num = 0, GPU_COUNT - 1

    call calculate(data1_dv(device_num+1,:,:), data2_dv(device_num+1,:,:), data3_dv(device_num+1,:,:), x_dv(device_num+1,:,:), y_dv(device_num+1,:,:), qf,&
     device_num, GPU_COUNT,xn, sbox_width, bbox_width, delta_t12, overlap_flag,array_sp1_dv(device_num+1,:,:), array_dir1_dv(device_num+1,:,:), &
    array_sp2_dv(device_num+1,:,:), array_dir2_dv(device_num+1,:,:), CC1_dv(device_num+1,:,:,:), CC2_dv(device_num+1,:,:,:))

end do

!$acc wait
do device_num = 0, GPU_COUNT - 1

    !$ACC set device_num(device_num)
    !$acc exit data delete(data1_dv(device_num+1,:,:), data2_dv(device_num+1,:,:), data3_dv(device_num+1,:,:), x_dv(device_num+1,:,:), y_dv(device_num+1,:,:), qf)&
    !$acc& copyout(array_sp1_dv(device_num+1,:,:), array_dir1_dv(device_num+1,:,:),array_sp2_dv(device_num+1,:,:), array_dir2_dv(device_num+1,:,:), CC1_dv(device_num+1,:,:,:), CC2_dv(device_num+1,:,:,:)) async(device_num+1)

end do

!$acc wait
!here print array_sp1_dv is zero
print*,'xx',size(array_sp1_dv,1),size(array_sp1_dv,2),size(array_sp1_dv,3),array_sp1_dv(:,1000,1200)
do nn = 1,size(array_sp1_dv,2)
    do mm = 1,size(array_sp1_dv,3)
        if(array_sp1_dv(2,nn,mm) > 0 .and. array_sp1_dv(2,nn,mm) <= 100) then
            print*,'sp test xx',nn,mm,array_sp1_dv(2,nn,mm),size(array_sp1_dv,1),size(array_sp1_dv,2),size(array_sp1_dv,3)
        end if
    end do
end do

do device_num = 0, GPU_COUNT - 1
    start_row = device_num * chunk_size + 1
    end_row = (device_num + 1) * chunk_size  ! Ensure we don't exceed nLine
    nrows = end_row - start_row + 1
    array_sp1(start_row:end_row, :) = array_sp1_dv(device_num+1,:,:)
    array_dir1(start_row:end_row, :) = array_dir1_dv(device_num+1,:,:)
    array_sp2(start_row:end_row, :) = array_sp2_dv(device_num+1,:,:)
    array_dir2(start_row:end_row, :) = array_dir2_dv(device_num+1,:,:)

    CC1(:,start_row:end_row, :) = CC1_dv(device_num+1,:,:,:)
    CC2(:,start_row:end_row, :) = CC2_dv(device_num+1,:,:,:)
end do


subroutine calculate(data1, data2, data3, lon, lat,qf, device_num,GPU_COUNT,line, sbox_size, bbox_size, delta_t,overlap_flag, sp11, dir11, sp22, dir22, CC11, CC22)
    implicit none

    real, dimension(:,:), intent(in) :: data1, data2, data3
    real, dimension(:,:), intent(in) :: lon, lat
    integer(kind=1),dimension(:,:), intent(in) ::qf
    integer, value,intent(in)            ::  device_num,GPU_COUNT,line, sbox_size, bbox_size
    integer(kind=1), intent(in)     :: overlap_flag
    real, intent(in)               :: delta_t

    ! -----------------
    REAL, dimension(:,:), INTENT(OUT) :: sp11(:,:), dir11(:,:), sp22(:,:), dir22(:,:)
    REAL, dimension(:,:,:), INTENT(OUT) :: CC11(:,:,:), CC22(:,:,:)

    integer :: i, j, ii, jj, iii,istart1, jstart1, iend1, jend1, istart2, jstart2, tx1, ty1, tx2, ty2,txx1, tyy1, txx2, tyy2, tx, ty, nx, ny, boundary
    integer :: x, y, xx, yy, tag1, tag2, pp, oo, status, chunk_size, istart, iend
    real :: correlation1, correlation2, max_cc1, max_cc2
    real :: percent,lon1, lat1, lon2, lat2, lon3, lat3
    integer :: start_time, end_time, clock_rate, displace
    real :: elapsed_time
    real :: nan_value
    REAL :: E_t, E_s, E_ts, sita_t, sita_s
    nan_value = ieee_value(1.0, ieee_quiet_nan)
    tag1 = 1
    tag2 = 2
    nx = bbox_size - sbox_size + 1
    ny = bbox_size - sbox_size + 1
    boundary = size(data,1)
    chunk_size = line/GPU_COUNT

    istart = device_num*chunk_size+1
    iend = (device_num + 1) * chunk_size
    displace = (bbox_size - sbox_size)/2
    ALLOCATE(sub1(sbox_size, sbox_size), sub2(sbox_size, sbox_size), sub3(sbox_size, sbox_size), STAT=status)

    !$ACC set device_num(device_num)
    !$acc parallel loop gang vector collapse(2) private(sub1, sub2, sub3, max_cc1, max_cc2, istart2, jstart2, tx, ty, tx1, ty1, tx2, ty2,txx1,tyy1,txx2,tyy2) firstprivate(sbox_size, bbox_size) async(device_num+1)
    do i = istart, iend
        do j = 1, line

            if(qf(i,j) /= 0) then
                cycle
            end if
            if(device_num == 0) then
                 istart2 = (i-1) * sbox_size/(1+overlap_flag) + 1 - device_num*chunk_size
            else
                 istart2 = (i-1) * sbox_size/(1+overlap_flag) + 1 - device_num*chunk_size + displace
            end if
            jstart2 = (j-1) * sbox_size/(1+overlap_flag) + 1

            DO ii = 1, sbox_size
                DO jj = 1, sbox_size
                    sub2(ii, jj) =data2(istart2 + ii - 1, jstart2 + jj - 1)
                END DO
            END DO

            iii = i - device_num*chunk_size

            max_cc1 = -1.0
            max_cc2 = -1.0

            do x = 1, (bbox_size - sbox_size + 1)
                do y = 1, (bbox_size - sbox_size + 1)
                    istart1 = max(1, istart2 - (bbox_size-sbox_size)/2 + x - 1)  !set left right boundary but not neccessary
                    jstart1 = max(1, jstart2 - (bbox_size-sbox_size)/2 + y - 1)

                    iend1 = min(istart1+sbox_size-1,boundary)
                    jend1 = min(jstart1+sbox_size-1,boundary)

                    DO xx = 1, iend1 - istart1 + 1
                        DO yy = 1, jend1 - jstart1 + 1
                            sub1(xx, yy) = data1(istart1 + xx - 1, jstart1 + yy - 1)
                        END DO
                    END DO
                    DO xx = 1, iend1 - istart1 + 1
                        DO yy = 1, jend1 - jstart1 + 1
                            sub3(xx, yy) = data3(istart1 + xx - 1, jstart1 + yy - 1)
                        END DO
                    END DO
                    E_t = SUM(sub2) / (sbox_size * sbox_size)
                    E_s = SUM(sub1) / (sbox_size * sbox_size)
                    E_ts = SUM((sub2 - E_t) * (sub1 - E_s)) / (sbox_size * sbox_size)
                    sita_t = SQRT(SUM((sub2 - E_t)**2) / (sbox_size * sbox_size))
                    sita_s = SQRT(SUM((sub1 - E_s)**2) / (sbox_size * sbox_size))
                    IF (sita_t == 0.0 .OR. sita_s == 0.0) THEN
                        correlation1 = 0.0
                    ELSE
                        correlation1 = E_ts / (sita_t * sita_s)
                    END IF
                    if(correlation1 .gt. 1) then
                        correlation1 = 0
                    end if

                    E_t = SUM(sub2) / (sbox_size * sbox_size)
                    E_s = SUM(sub3) / (sbox_size * sbox_size)
                    E_ts = SUM((sub2 - E_t) * (sub3 - E_s)) / (sbox_size * sbox_size)
                    sita_t = SQRT(SUM((sub2 - E_t)**2) / (sbox_size * sbox_size))
                    sita_s = SQRT(SUM((sub3 - E_s)**2) / (sbox_size * sbox_size))
                    IF (sita_t == 0.0 .OR. sita_s == 0.0) THEN
                        correlation2 = 0.0
                    ELSE
                        correlation2 = E_ts / (sita_t * sita_s)
                    END IF
                    if(correlation2 .gt. 1) then
                        correlation2 = 0
                    end if

                    IF (max_cc1 < correlation1) THEN
                        max_cc1 = correlation1
                        txx1 = x
                        tyy1 = y
                    END IF

                    IF (max_cc2 < correlation2) THEN
                        max_cc2 = correlation2
                        txx2 = x
                        tyy2 = y
                    end if

                end do
            end do
            print*,'sp and dir', i,j,txx1,tyy1,sp11(iii,j), sp22(iii,j), dir11(iii,j), dir22(iii,j)  !ok
        end do
    end do
    !$acc end parallel

    DEALLOCATE(sub1, sub2, sub3)

end subroutine calculate

Sorry, but I’ll still need to ask for a full reproducing example that I can build and run here. It very rare for folks to program multi-GPUs this way (I always recommend using MPI+OpenACC), so will need to experiment with your code a bit to understand where the problem is.

Thank you very much for your suggestions! I may use MPI combined with OpenACC for future improvements. Since the code and data are quite complex, reproducing the issue might be a bit tricky. I have uploaded the code and data to this website:acc_test.zip, so if you have time, you can try debugging it.

 h5fc -acc -Minfo=accel -o cc calculate_correlation_allocate.f90
 ./cc xx.nml

As for the current issue, I would like to show you the output. When I print within the parallel region, the calculated values are correct, but after transferring the array from the device, the printed values are all zeros.

!$acc wait
call cpu_time(compute_end)
call cpu_time(date_out_start)
do device_num = 0, GPU_COUNT - 1
  !$ACC set device_num(device_num)
  !$acc exit data delete(data1_dv(device_num+1,:,:), data2_dv(device_num+1,:,:), data3_dv(device_num+1,:,:), lon_dv(device_num+1,:,:), lat_dv(device_num+1,:,:))&
  !$acc& copyout(array_sp1_dv(device_num+1,:,:), array_dir1_dv(device_num+1,:,:),array_sp2_dv(device_num+1,:,:), array_dir2_dv(device_num+1,:,:), CC1_dv(device_num+1,:,:,:), CC2_dv(device_num+1,:,:,:)) async(device_num+1)
end do
!$acc wait
print*,' 98 125',array_sp1_dv(1,98,125)

output:98 125 1.3987837E-05.and when I print in the calculate function

print*,'xx',i,j,sp11(iii,j),sp22(iii,j),iii

output:xx, 98 125 6.390867.it’s correct result.So i think the issue might be that the data is not being correctly output when transferring it from the device through the function.

Thanks!

The problem is with the passing of the non-contiguous slice of the 3D array to 2D. In this case, the compiler must create a temp array to hold the 2D array and it’s the temp array that’s actually being used on the device. Worse, the temp arrays get implicitly copied to/from the device.

To fix, you’ll need to pass the full 3D arrays into the subroutine and then index the first dimension by the device number.

Note that the full array will get created on each device in order to maintain correct indexing so you’ll be wasting some memory by using an index to hold the device number.

Hope this helps,
Mat

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.