($acc parallel loop) VS ( $acc kernels loop ) ?

Hi Mat and All,
I found very surprising speed difference (about 20 times) between these two from the following simple loop tests:

1, $acc kernels loop :
CODE:
call system_clock(count1, count_rate, count_max)
!$acc kernels loop
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo

print*, ‘iternation#:’,n_size*n_size

call system_clock(count2, count_rate, count_max)
write(,)‘GPU costs’,(count2-count1),‘micronseconds’

RESULTS:
iteration#: 4000000
GPU costs 1030000 micronseconds


2, $acc parallel loop :

CODE:
call system_clock(count1, count_rate, count_max)
!$acc parallel loop
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)b(k,j)
enddo
enddo
enddo
!$acc end parallel
print
, ‘iternation#:’,n_size*n_size

call system_clock(count2, count_rate, count_max)
write(,)‘GPU costs’,(count2-count1),‘micronseconds’

RESULTS:

iteration#: 4000000
GPU costs 22168000 micronseconds

Why they are so different? Any inputs of the reasons behind this is very appreciated.

Thanks,
Jingsen

Hi Jingsen,

The main difference between the “kernels” and “parallel” constructs is that with “kernels” the default is for the compiler do all the scheduling and kernel generation automatically, while with “parallel”, it’s up to the user to decide how to create the kernels and schedule the loops.

This article goes more in-depth: Account Login | PGI

Take a look at the compiler feedback messages (-Minfo=accel) and pay particular attention to how the loops are being scheduled. This should give you your answer as to the performance difference. Note that the schedule will also effect the use of caching, which may be another factor in the performance difference.

  • Mat