Hi Jingsen,
You would add the “async” clause to the kernels directive. The caveat being that no data can be copied back at the end of the compute region else the code will block waiting for the data. The solution is to add a data region around the compute region and the host loop. You’d need to remove the sum reduction as well since the scalar needs to be copied back at the end of the region.
% cat matrix.f90
! matrix-acc.f
program example1
parameter ( n_size=2000 )
real*8, dimension(:,:) :: a(n_size,n_size)
real*8, dimension(:,:) :: b(n_size,n_size)
real*8, dimension(:,:) :: c(n_size,n_size)
real*8, dimension(:,:) :: d(n_size,n_size)
character(10) :: time
real tmp, tmp_cpu
integer count1, count2, count_rate, count_max
! Initialize matrices (values differ from C version)
do i=1, n_size
do j=1, n_size
a(i,j) = i + j;
b(i,j) = i - j;
enddo
enddo
d=0.d0
!$acc data copyin(a,b) copyout(c)
call system_clock(count1, count_rate, count_max)
!$acc kernels async
c=0.d0
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
!$acc end kernels
call system_clock(count2, count_rate, count_max)
write(*,*)'GPU costs',(count2-count1),'micronseconds'
call system_clock(count1, count_rate, count_max)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
d(i,j) = d(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
!$acc end data
call system_clock(count2, count_rate, count_max)
write(*,*)'CPU costs',(count2-count1),'micronseconds'
! check the results
do i=1, n_size
do j=1, n_size
if( c(i,j) .ne. d(i,j) )then
print *, i,j, c(i,j), d(i,j)
stop 'error found'
endif
enddo
enddo
print *, n_size*n_size, 'iterations completed'
end program example1
% pgf90 -fast matrix.f90 -acc -ta=nvidia,4.2 -Minfo
example1:
15, Loop interchange produces reordered loop nest: 16,15
Generated vector sse code for the loop
21, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
22, Generating copyout(c(:,:))
Generating copyin(b(:,:))
Generating copyin(a(:,:))
25, Generating present_or_copyin(b(:,:))
Generating present_or_copyin(a(:,:))
Generating present_or_copyout(c(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
26, Loop is parallelizable
Accelerator kernel generated
26, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
!$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 1.3 : 13 registers; 28 shared, 24 constant, 0 local memory bytes
CC 2.0 : 13 registers; 0 shared, 44 constant, 0 local memory bytes
27, Loop is parallelizable
28, Loop is parallelizable
29, Complex loop carried dependence of 'c' prevents parallelization
Loop carried dependence of 'c' prevents parallelization
Loop carried backward dependence of 'c' prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
27, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
28, !$acc loop gang ! blockidx%y
29, CC 1.3 : 15 registers; 44 shared, 4 constant, 0 local memory bytes
CC 2.0 : 20 registers; 0 shared, 60 constant, 0 local memory bytes
40, Loop interchange produces reordered loop nest: 41,42,40
Generated an alternate version of the loop
Generated vector sse code for the loop
Generated 2 prefetch instructions for the loop
55, Loop not vectorized/parallelized: contains call
% a.out
GPU costs 52583 micronseconds
CPU costs 5972312 micronseconds
4000000 iterations completed
% setenv PGI_ACC_SYNCHRONOUS 1 << disable async
% a.out
GPU costs 466722 micronseconds
CPU costs 5934507 micronseconds
4000000 iterations completed