Hi Mat and All,
I found very surprising speed difference (about 20 times) between these two from the following simple loop tests:
1, $acc kernels loop :
CODE:
call system_clock(count1, count_rate, count_max)
!$acc kernels loop
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
print*, ‘iternation#:’,n_size*n_size
call system_clock(count2, count_rate, count_max)
write(,)‘GPU costs’,(count2-count1),‘micronseconds’
RESULTS:
iteration#: 4000000
GPU costs 1030000 micronseconds
2, $acc parallel loop :
CODE:
call system_clock(count1, count_rate, count_max)
!$acc parallel loop
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)b(k,j)
enddo
enddo
enddo
!$acc end parallel
print, ‘iternation#:’,n_size*n_size
call system_clock(count2, count_rate, count_max)
write(,)‘GPU costs’,(count2-count1),‘micronseconds’
RESULTS:
iteration#: 4000000
GPU costs 22168000 micronseconds
Why they are so different? Any inputs of the reasons behind this is very appreciated.
Thanks,
Jingsen