Loop optimisation question

Hello, I’ve ported a routine to run on a Tesla card using the PGI accelerator directives. The code uses double precision and I’m seeing a speed-up of approximately a factor of three over a single Nehalem core. Basically I’m wondering whether I can do better. The most expensive bits of the routine look like:

          DO jk = 1, jpkm1
            DO jj = 1 , jpjm1
               DO ji = 1, jpim1 
                  zabe1 = ( fsahtu(ji,jj,jk) + pahtb0 ) * e2u(ji,jj) * fse3u(ji,jj,jk) / e1u(ji,jj)
                  zabe2 = ( fsahtv(ji,jj,jk) + pahtb0 ) * e1v(ji,jj) * fse3v(ji,jj,jk) / e2v(ji,jj)
                  !
                  zmsku = 1.0_wp / MAX(  tmask(ji+1,jj,jk  ) + tmask(ji,jj,jk+1)   &
                     &             + tmask(ji+1,jj,jk+1) + tmask(ji,jj,jk  ), 1.0_wp )
                  !
                  zmskv = 1.0_wp / MAX(  tmask(ji,jj+1,jk  ) + tmask(ji,jj,jk+1)   &
                     &             + tmask(ji,jj+1,jk+1) + tmask(ji,jj,jk  ), 1.0_wp )
                  !
                  zcof1 = - fsahtu(ji,jj,jk) * e2u(ji,jj) * uslp(ji,jj,jk) * zmsku
                  zcof2 = - fsahtv(ji,jj,jk) * e1v(ji,jj) * vslp(ji,jj,jk) * zmskv
                  !
                  zftu(ji,jj,jk ) = (  zabe1 * zdit(ji,jj,jk,jn)   &
                     &              + zcof1 * (  zdkt (ji+1,jj,jk) + zdk1t(ji,jj,jk)      &
                     &                         + zdk1t(ji+1,jj,jk) + zdkt (ji,jj,jk)  )  ) * umask(ji,jj,jk)
                  zftv(ji,jj,jk) = (  zabe2 * zdjt(ji,jj,jk,jn)   &
                     &              + zcof2 * (  zdkt (ji,jj+1,jk) + zdk1t(ji,jj,jk)      &
                     &                         + zdk1t(ji,jj+1,jk) + zdkt (ji,jj,jk)  )  ) * vmask(ji,jj,jk)
               END DO
            END DO
         END DO

where jpkm1=30, jpjm1=148 and jpim1=181. The loop is within a data region and I’m using the ‘time’ option to -ta=nvidia to get data on how long each kernel is taking. The compiler output for this nested loop (which begins at line 246 in my source) is:

    246, Loop is parallelizable
         Accelerator kernel generated
        240, !$acc do parallel, vector(4) ! blockidx%y threadidx%z
        244, !$acc do parallel, vector(8) ! blockidx%x threadidx%y
        246, !$acc do vector(16) ! threadidx%x
             Cached references to size [16x8] block of 'ahtu'
             Cached references to size [16x8] block of 'e1u'
             Cached references to size [16x8] block of 'e1v'
             Cached references to size [16x8] block of 'ahtv'
             Cached references to size [16x8] block of 'e2v'
             Cached references to size [17x9x4] block of 'zdk1t'
             Cached references to size [17x9x4] block of 'zdkt'
             CC 1.3 : 32 registers; 15256 shared, 1172 constant, 184 local memory bytes; 50% occupancy
             CC 2.0 : 32 registers; 15240 shared, 1176 constant, 0 local memory bytes; 66% occupancy

I guess this code is just very heavy on its memory accesses. I’ve played with the loop schedule a bit but without having any significant effect. Is there anything else that people can recommend?

Thanks very much,

Andy.

I’m not an expert with the Accelerator pragmas–it’s been a while since I’ve used them–but I know from experience with CUDA Fortran that usually you want the stride-one index to be the outer loop to help with memory access. Thus, I’d do this loop as ji, then jj, then jk. However, the pragmas might have already reordered these loops…

Still, have you tried hand-reordering the loops to see if that helps? (Also, there is the index jn this sample. Is that being looped over often? Is it static? If not is there a way to move it inside the ji-jj-jk loops?)

Finally, is your speedup including data transfer? If so, maybe using pinned memory could help, but sometimes an overall speedup is just hamstrung by the cost of moving data on and off the GPUs.

Matt

I’ve just tried putting the ji loop outermost but that didn’t help. Thanks for pointing out the jn index. That is for a very short loop (trip count of 2 in the test case I have) which I’ve currently got outside of the !$ACC REGION as it was causing problems for the compiler.
I could re-write and make that the innermost loop, possibly unrolling it completely.
I can also remove the jn index from the loop in question completely as the arrays here only hold intermediate results. This tidies things up but has negligible effect on performance.

No, it doesn’t include data transfer - I’m explicitly keeping that separate by doing a timing loop within a !$ACC DATA REGION. As this is my first foray into this technology I’m currently just looking to get to grips with the code optimisation/restructuring issues. I see your point though - if I can’t do my host<->device data transfers asynchronously in the actual application then there’s no point going to the GPU.

Thanks for your tips.

Andy.

Hi Andy,

From the Minfo output, it appears to me that the compiler is doing a good job in both scheduling the kernel as well as utilizing the shared cache. However, you do use several divides and this may be causing the slow down.

By default, the compiler uses a slower but more precise divide. Try using the “-ta=nvidia,fastmath” flag to use the faster version. Of course. check your answers, Just as an FYI, on the Fermi cards, the performance of the precise divide improved so that it’s no longer slower than the imprecise divide.

Also, can you post you profile (i.e. -ta=nvidia,time) output? I’d like to see if you’re running into any initialization costs. On Linux, the OS will power down your devices causing a ~1 second per device cost when first accessing the devices. You can eliminate this cost by running the PGI utility “pgcudainit” in the background so the devices are held open, or mask the cost by calling “acc_init” before any internal timers.

In addition to the warm-up cost, there is some overhead when loading the kernel on to the device. Though, this will only effect very short runs since longer runs will amortize the cost.

Hope this helps,
Mat

Hi Mat,

By default, the compiler uses a slower but more precise divide. Try using the “-ta=nvidia,fastmath” flag to use the faster version. Of course. check your answers, Just as an FYI, on the Fermi cards, the performance of the precise divide improved so that it’s no longer slower than the imprecise divide.

I got excited there but on trying it I see no improvement at all (so much so that I’ve checked quite closely that I’m building and running the binary I think I am but it all seems right).

Also, can you post you profile (i.e. -ta=nvidia,time) output? I’d like to see if you’re running into any initialization costs. On Linux, the OS will power down your devices causing a ~1 second per device cost when first accessing the devices. You can eliminate this cost by running the PGI utility “pgcudainit” in the background so the devices are held open, or mask the cost by calling “acc_init” before any internal timers.

The initialization shouldn’t be a problem as I do one call of my routine (on the GPU) and check the answers and then (in the same program) I call it again and get it to do 100 iterations and time that. The full profile (the loop I’ve posted is the kernel at 248) looks like:

Accelerator Kernel Timing data
/data1/arp28/gnemo/nemo_local/NEMOGCM/EXTERNAL/GPU/traldf_iso_harness.F90
  kernel
    172: region entered 101 times
        time(us): total=3339063 init=31 region=3339032
                  kernels=3189240 data=46445
        w/o init: total=3339032 max=47462 min=32572 avg=33059
        186: kernel launched 202 times
            grid: [19x4]  block: [8x8]
            time(us): total=4072 max=35 min=19 avg=20
        198: kernel launched 202 times
            grid: [19x15]  block: [16x8x2]
            time(us): total=161511 max=846 min=794 avg=799
        207: kernel launched 202 times
            grid: [12x10]  block: [16x16]
            time(us): total=8527 max=45 min=41 avg=42
        226: kernel launched 202 times
            grid: [10x32]  block: [16x16]
            time(us): total=163804 max=821 min=800 avg=810
        235: kernel launched 202 times
            grid: [12x10]  block: [16x16]
            time(us): total=7267 max=37 min=35 avg=35
        248: kernel launched 202 times
            grid: [37x8]  block: [16x4x4]
            time(us): total=1382742 max=6863 min=6830 avg=6845
        280: kernel launched 202 times
            grid: [147x2]  block: [16x16]
            time(us): total=305232 max=1523 min=1495 avg=1511
        334: kernel launched 202 times
            grid: [4x19]  block: [8x8]
            time(us): total=4115 max=21 min=20 avg=20
        346: kernel launched 202 times
            grid: [12x10]  block: [16x16]
            time(us): total=4383 max=22 min=21 avg=21
        357: kernel launched 202 times
            grid: [10x32]  block: [16x16]
            time(us): total=894902 max=4481 min=4399 avg=4430
        381: kernel launched 202 times
            grid: [147x2]  block: [16x16]
            time(us): total=252685 max=1256 min=1248 avg=1250
traldf_iso_harness.F90
  kernel
    165: region entered 2 times
        time(us): total=7181055 init=3768652 region=3412403
                  data=61406
        w/o init: total=3412403 max=3341430 min=70973 avg=1706201

Following Michael Wolfe’s article on WRF, I’ve created a version of this benchmark that consists of a single, nested loop. However, it proves to be about 60% slower than the one I’ve got here.

Thanks for your help,

Andy.

Hi Andy,

I’m not sure then. Is the code something you could share with us? Is so, please send it to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me.

Thanks,
Mat