have cpu & gpu computing concurrently?

Hi Mat,
For the following sample I posted a few days ago, the cpu computing for the loop starts after gpu computing finishs. I’m just wondering how should we make them both conducting the loop computation concurrently? What kind of $acc directives I should use for it?

Thanks,
Jingsen

! matrix-acc.f
program example1


parameter ( n_size=2000 )
real8, dimension(:,:) :: a(n_size,n_size)
real
8, dimension(:,:) :: b(n_size,n_size)
real8, dimension(:,:) :: c(n_size,n_size)
real
8, dimension(:,:) :: d(n_size,n_size)
character(10) :: time
real tmp
integer count1, count2, count_rate, count_max


! Initialize matrices (values differ from C version)
do i=1, n_size
do j=1, n_size
a(i,j) = i + j;
b(i,j) = i - j;
enddo
enddo
c=0.d0
d=0.d0

tmp=0.d0
call system_clock(count1, count_rate, count_max)
!$acc kernels loop !reduction(+:tmp)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
tmp=tmp+1.d0
enddo
enddo
enddo

print*, ‘iternation#:’,tmp

call system_clock(count2, count_rate, count_max)
write(,)‘GPU costs’,(count2-count1),‘micronseconds’

tmp=0.d0
call system_clock(count1, count_rate, count_max)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
d(i,j) = d(i,j) + a(i,k)b(k,j)
tmp=tmp+1.d0
enddo
enddo
enddo
call system_clock(count2, count_rate, count_max)
write(
,*)‘CPU costs’,(count2-count1),‘micronseconds’

! check the results
do i=1, n_size
do j=1, n_size
if( c(i,j) .ne. d(i,j) )then
print *, i,j, c(i,j), d(i,j)
stop ‘error found’
endif
enddo
enddo
print , n_sizen_size, ‘iterations completed’


end program example1

Hi Jingsen,

You would add the “async” clause to the kernels directive. The caveat being that no data can be copied back at the end of the compute region else the code will block waiting for the data. The solution is to add a data region around the compute region and the host loop. You’d need to remove the sum reduction as well since the scalar needs to be copied back at the end of the region.

  • Mat
% cat matrix.f90
! matrix-acc.f
program example1


parameter ( n_size=2000 )
real*8, dimension(:,:) :: a(n_size,n_size)
real*8, dimension(:,:) :: b(n_size,n_size)
real*8, dimension(:,:) :: c(n_size,n_size)
real*8, dimension(:,:) :: d(n_size,n_size)
character(10) :: time
real tmp, tmp_cpu
integer count1, count2, count_rate, count_max

! Initialize matrices (values differ from C version)
do i=1, n_size
do j=1, n_size
a(i,j) = i + j;
b(i,j) = i - j;
enddo
enddo
d=0.d0
!$acc data copyin(a,b) copyout(c)

call system_clock(count1, count_rate, count_max)
!$acc kernels async
c=0.d0
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
!$acc end kernels

call system_clock(count2, count_rate, count_max)
write(*,*)'GPU costs',(count2-count1),'micronseconds'

call system_clock(count1, count_rate, count_max)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
d(i,j) = d(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo

!$acc end data

call system_clock(count2, count_rate, count_max)
write(*,*)'CPU costs',(count2-count1),'micronseconds'

! check the results
do i=1, n_size
do j=1, n_size
if( c(i,j) .ne. d(i,j) )then
print *, i,j, c(i,j), d(i,j)
stop 'error found'
endif
enddo
enddo
print *, n_size*n_size, 'iterations completed'


end program example1
% pgf90 -fast matrix.f90 -acc -ta=nvidia,4.2 -Minfo
example1:
     15, Loop interchange produces reordered loop nest: 16,15
         Generated vector sse code for the loop
     21, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
     22, Generating copyout(c(:,:))
         Generating copyin(b(:,:))
         Generating copyin(a(:,:))
     25, Generating present_or_copyin(b(:,:))
         Generating present_or_copyin(a(:,:))
         Generating present_or_copyout(c(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     26, Loop is parallelizable
         Accelerator kernel generated
         26, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(64) ! blockidx%x threadidx%x
             CC 1.3 : 13 registers; 28 shared, 24 constant, 0 local memory bytes
             CC 2.0 : 13 registers; 0 shared, 44 constant, 0 local memory bytes
     27, Loop is parallelizable
     28, Loop is parallelizable
     29, Complex loop carried dependence of 'c' prevents parallelization
         Loop carried dependence of 'c' prevents parallelization
         Loop carried backward dependence of 'c' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         27, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         28, !$acc loop gang ! blockidx%y
         29, CC 1.3 : 15 registers; 44 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 20 registers; 0 shared, 60 constant, 0 local memory bytes
     40, Loop interchange produces reordered loop nest: 41,42,40
         Generated an alternate version of the loop
         Generated vector sse code for the loop
         Generated 2 prefetch instructions for the loop
     55, Loop not vectorized/parallelized: contains call
% a.out
 GPU costs        52583 micronseconds
 CPU costs      5972312 micronseconds
      4000000 iterations completed
% setenv PGI_ACC_SYNCHRONOUS 1  << disable async
% a.out
 GPU costs       466722 micronseconds
 CPU costs      5934507 micronseconds
      4000000 iterations completed