Hello, I’ve ported a routine to run on a Tesla card using the PGI accelerator directives. The code uses double precision and I’m seeing a speed-up of approximately a factor of three over a single Nehalem core. Basically I’m wondering whether I can do better. The most expensive bits of the routine look like:
DO jk = 1, jpkm1
DO jj = 1 , jpjm1
DO ji = 1, jpim1
zabe1 = ( fsahtu(ji,jj,jk) + pahtb0 ) * e2u(ji,jj) * fse3u(ji,jj,jk) / e1u(ji,jj)
zabe2 = ( fsahtv(ji,jj,jk) + pahtb0 ) * e1v(ji,jj) * fse3v(ji,jj,jk) / e2v(ji,jj)
!
zmsku = 1.0_wp / MAX( tmask(ji+1,jj,jk ) + tmask(ji,jj,jk+1) &
& + tmask(ji+1,jj,jk+1) + tmask(ji,jj,jk ), 1.0_wp )
!
zmskv = 1.0_wp / MAX( tmask(ji,jj+1,jk ) + tmask(ji,jj,jk+1) &
& + tmask(ji,jj+1,jk+1) + tmask(ji,jj,jk ), 1.0_wp )
!
zcof1 = - fsahtu(ji,jj,jk) * e2u(ji,jj) * uslp(ji,jj,jk) * zmsku
zcof2 = - fsahtv(ji,jj,jk) * e1v(ji,jj) * vslp(ji,jj,jk) * zmskv
!
zftu(ji,jj,jk ) = ( zabe1 * zdit(ji,jj,jk,jn) &
& + zcof1 * ( zdkt (ji+1,jj,jk) + zdk1t(ji,jj,jk) &
& + zdk1t(ji+1,jj,jk) + zdkt (ji,jj,jk) ) ) * umask(ji,jj,jk)
zftv(ji,jj,jk) = ( zabe2 * zdjt(ji,jj,jk,jn) &
& + zcof2 * ( zdkt (ji,jj+1,jk) + zdk1t(ji,jj,jk) &
& + zdk1t(ji,jj+1,jk) + zdkt (ji,jj,jk) ) ) * vmask(ji,jj,jk)
END DO
END DO
END DO
where jpkm1=30, jpjm1=148 and jpim1=181. The loop is within a data region and I’m using the ‘time’ option to -ta=nvidia to get data on how long each kernel is taking. The compiler output for this nested loop (which begins at line 246 in my source) is:
246, Loop is parallelizable
Accelerator kernel generated
240, !$acc do parallel, vector(4) ! blockidx%y threadidx%z
244, !$acc do parallel, vector(8) ! blockidx%x threadidx%y
246, !$acc do vector(16) ! threadidx%x
Cached references to size [16x8] block of 'ahtu'
Cached references to size [16x8] block of 'e1u'
Cached references to size [16x8] block of 'e1v'
Cached references to size [16x8] block of 'ahtv'
Cached references to size [16x8] block of 'e2v'
Cached references to size [17x9x4] block of 'zdk1t'
Cached references to size [17x9x4] block of 'zdkt'
CC 1.3 : 32 registers; 15256 shared, 1172 constant, 184 local memory bytes; 50% occupancy
CC 2.0 : 32 registers; 15240 shared, 1176 constant, 0 local memory bytes; 66% occupancy
I guess this code is just very heavy on its memory accesses. I’ve played with the loop schedule a bit but without having any significant effect. Is there anything else that people can recommend?
Thanks very much,
Andy.