Hi Mat and All,

I found very surprising speed difference (about 20 times) between these two from the following simple loop tests:

1, $acc kernels loop :

CODE:

call system_clock(count1, count_rate, count_max)

!$acc kernels loop

do i=1, n_size

do j=1, n_size

do k = 1, n_size

c(i,j) = c(i,j) + a(i,k)*b(k,j)

enddo

enddo

enddo

print*, ‘iternation#:’,n_size*n_size

call system_clock(count2, count_rate, count_max)

write(*,*)‘GPU costs’,(count2-count1),‘micronseconds’

RESULTS:

iteration#: 4000000

GPU costs 1030000 micronseconds

2, $acc parallel loop :

CODE:

call system_clock(count1, count_rate, count_max)

!$acc parallel loop

do i=1, n_size

do j=1, n_size

do k = 1, n_size

c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
!$acc end parallel
print*, ‘iternation#:’,n_size*n_size

call system_clock(count2, count_rate, count_max)

write(*,*)‘GPU costs’,(count2-count1),‘micronseconds’

RESULTS:

iteration#: 4000000

GPU costs 22168000 micronseconds

Why they are so different? Any inputs of the reasons behind this is very appreciated.

Thanks,

Jingsen