Error and huge slowdown from !$acc region

Hi there,

I’m new to PGI Accelerator compilers, and I’m struggling with getting the acceleration to work. My machine has two 6-core Intel Xeon X5680 CPUs with a single GeForce GTX 480 GPU, and I’m running a well-known quantum chemistry code written in Fortran95 with extensive use of MPI.

For the application in question, the code spends 80% of its time running the following loop and I’m understandably trying to accelerate it (in the way shown):

!$acc data region copyin(aaa(:,s,set)) copyin(basis_ri(:)) &
!$acc copyin(basis_rj(:)) copyin(basis_rij(:))

<snip>

    value=0.d0
    lmn=0
!$acc region
    do n=0,Nee
     valb=0.d0
     do m=0,NeN
      valc=0.d0
      do l=0,NeN
       lmn=lmn+1
       valc=valc+aaa(lmn,s,set)*basis_ri(l)
      enddo ! l
      valb=valb+valc*basis_rj(m)
     enddo ! m
     value=value+valb*basis_rij(n)
    enddo ! n
!$acc end region

!$acc end data region

I compile this with:

mpif90 -ta=nvidia:cc20,time,cuda4.1 -fast -Minfo=accel -Mprof=time,lines,func

giving

   9486, Generating copyin(basis_rij(:))
         Generating copyin(basis_rj(:))
         Generating copyin(basis_ri(:))
         Generating copyin(aaa(:,s,set))
   9900, Generating compute capability 2.0 binary
   9901, Loop is parallelizable
         Accelerator kernel generated
       9901, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 2.0 : 26 registers; 2052 shared, 176 constant, 0 local memory bytes; 66% occupancy
       9911, Sum reduction generated for value
   9903, Loop is parallelizable
   9905, Inner sequential loop scheduled on accelerator
   9906, Accelerator restriction: multilevel induction variable: lmn

Timings and answers are as follows (using only 1 core of the CPU):

NO ACCELERATION

Total energy (au) = -289.380005052746
Total CPU time : 17.5300

ACCELERATOR DIRECTIVES AS SHOWN

Total energy (au) = 16339512.037687273696 Oops!
Total CPU time : 288.3300 (i.e. 17 times slower…)

ACCELERATOR REGION AROUND THE M LOOP

Total energy (au) = 16339512.037687409669
Total CPU time : 394.3000

ACCELERATOR REGION AROUND THE L LOOP

Total energy (au) = 16339512.037687413394
Total CPU time : 686.0500

The Accelerator timing data for ‘ACCELERATOR DIRECTIVES AS SHOWN ABOVE’ is as follows:

Accelerator Kernel Timing data
    9900: region entered 1395318 times
        time(us): total=61087595 init=108644 region=60978951
                  kernels=21364685 data=0
        w/o init: total=60978951 max=441 min=42 avg=43
        9901: kernel launched 1395318 times
            grid: [1]  block: [256]
            time(us): total=14579005 max=80 min=8 avg=10
        9911: kernel launched 1395318 times
            grid: [1]  block: [256]
            time(us): total=6785680 max=20 min=4 avg=4
    9486: region entered 1569865 times
        time(us): total=277645992 init=707668 region=276938324
                  data=33027038
        w/o init: total=276938324 max=727 min=136 avg=176

So the accelerated code (a) gives the wrong answer, and (b) is hugely slower. OK, now I admit I’m still in my stage 1 of this process - which is ‘stick in the acc region statements and see what happens’ without any serious analysis of precisely what it’s doing.

Now that’s fine (because that’s kind of the point of the accelerator model) but I always expected to go into it more deeply in stage 2 where I would optimize the things to make it go faster. However, I didn’t expect things to screw up so badly at the beginning.

So my question:
(a) is there any obvious reason why this should happen?
(b) does anyone have any tips for accelerating this loop using only accelerator directives?

I’m very grateful in advance for your time.

Cheers,
Django

Hi Django,

I would expect the GPU to differ slightly from the CPU due to the order of operations and reductions, but not as “wrong” as your result show. Wrong answers are typically caused by data regions and data either not synchronised with the host or wrong bounds. Here, I’d first start by removing the data region and next copying all of “aaa” instead of the single X dimension.

As for the performance, notice the profile’s grid size of 1x256. This is very poor and indicates that you don’t have enough parallelization. What is the size of Nee and NeN?

This lack of parallelism coupled with being called 1.4 million times will slow your program down.

First step will be to increase your parallelism since currently only the outer loop can be parallelized. The first constraint is the lmn variable. This needs to be changed so that it’s value is computed from the index variables instead of incremented. Next, manually privatise the “valb” and “valc” arrays and break-up the loops to expose more parallelism.

Next you need to work on data movement. Notice that the data region was entered 1.5 million times. So the data is getting copied more times then the kernel is executed. Do the arrays change from call to call? If not, can you move the data region earlier in your program and use the “mirror” or “reflected” directives and the “update” directive so the data is only copied a few times?


While this may or may not work for your code, I’m thinking some like the following:

% cat test2.f90


PROGRAM MAIN

IMPLICIT NONE

    real(8), dimension(:,:,:), allocatable :: aaa, valc
    real(8), dimension(:,:), allocatable :: valb
    real(8), dimension(:), allocatable :: basis_rl, basis_rj, basis_rij
    real(8) :: value
    integer :: lmn,s,set,Nee,NeN, n,m,l, NN,NM

    NN=32
    NM=32
    Nee=32
    NeN=32
    allocate(aaa(Nee*NeN*NeN,NN,NM))
    allocate(valc(0:Nee,0:NeN,0:NeN))
    allocate(valb(0:Nee,0:NeN))
    allocate(basis_rl(0:NeN))
    allocate(basis_rj(0:NeN))
    allocate(basis_rij(0:Nee))
    basis_rl=1.2
    basis_rj=1.1
    basis_rij=1.125
    aaa=2.345

!$acc data region copyin(aaa,basis_rl,basis_rj,basis_rij), local(valb,valc)
    do s=1,NN
      do set=1,NM
   value=0.d0
!$acc region
    do n=0,Nee
     do m=0,NeN
      do l=0,NeN
       lmn=(n*(Nee+1)*(NeN+1))+(m*(NeN+1))+l+1
       valc(n,m,l)=aaa(lmn,s,set)*basis_rl(l)
      enddo ! l
      enddo ! l
      enddo ! l
    do n=0,Nee
     do m=0,NeN
      valb(n,m)=0.0
      do l=0,NeN
         valb(n,m)=valb(n,m)+valc(n,m,l)*basis_rj(m)
      enddo
     enddo ! m
     enddo ! m
    do n=0,Nee
     do m=0,NeN
       value=value+valb(n,m)*basis_rij(n)
     enddo
    enddo ! n
!$acc end region
     print *, value
     enddo ! set
     enddo ! s
!$acc end data region

END PROGRAM
danger3:/tmp/qa% pgf90 test2.f90 -ta=nvidia,time -Minfo -V12.3
main:
     26, Memory set idiom, array assignment replaced by call to pgf90_mset8
     28, Generating local(valc(:,:,:))
         Generating local(valb(:,:))
         Generating copyin(basis_rij(:))
         Generating copyin(basis_rj(:))
         Generating copyin(basis_rl(:))
         Generating copyin(aaa(:,:,:))
     32, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     33, Loop is parallelizable
     34, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         33, !$acc do parallel, vector(4) ! blockidx%y threadidx%y
         34, !$acc do parallel, vector(4) ! blockidx%x threadidx%z
         35, !$acc do vector(16) ! threadidx%x
             Cached references to size [16] block of 'basis_rl'
             CC 1.3 : 14 registers; 248 shared, 32 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 14 registers; 144 shared, 128 constant, 0 local memory bytes; 100% occupancy
     41, Loop is parallelizable
     42, Loop is parallelizable
         Accelerator kernel generated
         41, !$acc do parallel, vector(32) ! blockidx%y threadidx%x
         42, !$acc do parallel, vector(8) ! blockidx%x threadidx%y
             Cached references to size [8] block of 'basis_rj'
             CC 1.3 : 17 registers; 160 shared, 24 constant, 0 local memory bytes; 75% occupancy
             CC 2.0 : 24 registers; 72 shared, 112 constant, 0 local memory bytes; 83% occupancy
     44, Complex loop carried dependence of 'valb' prevents parallelization
         Loop carried reuse of 'valb' prevents parallelization
         Inner sequential loop scheduled on accelerator
     49, Loop is parallelizable
     50, Loop is parallelizable
         Accelerator kernel generated
         49, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             Cached references to size [16] block of 'basis_rij'
         50, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
             CC 1.3 : 10 registers; 2120 shared, 40 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 14 registers; 2056 shared, 88 constant, 0 local memory bytes; 100% occupancy
         51, Sum reduction generated for value
% a.out
... cut
    125144.3227370447
    125144.3227370447
    125144.3227370447

Accelerator Kernel Timing data
/tmp/qa/test2.f90
  main
    32: region entered 1024 times
        time(us): total=99877 init=95 region=99782
                  kernels=50358 data=0
        w/o init: total=99782 max=452 min=95 avg=97
        35: kernel launched 1024 times
            grid: [9x9]  block: [16x4x4]
            time(us): total=23505 max=29 min=22 avg=22
        42: kernel launched 1024 times
            grid: [5x2]  block: [32x8]
            time(us): total=11569 max=14 min=11 avg=11
        50: kernel launched 1024 times
            grid: [3x3]  block: [16x16]
            time(us): total=9280 max=12 min=8 avg=9
        51: kernel launched 1024 times
            grid: [1]  block: [256]
            time(us): total=6004 max=12 min=5 avg=5
/tmp/qa/test2.f90
  main
    28: region entered 1 time
        time(us): total=1576166 init=1398893 region=177273
                  data=47919
        w/o init: total=177273 max=177273 min=177273 avg=177273

Hope this helps,
Mat

Hi Mat,

Thanks very much for your detailed and helpful analysis - it is much appreciated.

I’ll get back to you this evening with a full report on your suggestions (apologies for the slight delay - it’s Saturday and I had promised to take my kid to the beach…).

Cheers,
Django

OK - I’ve been following Mat’s advice. Let me report on progress. First, I should present again my original code, since my first post didn’t include all the outer loops and so on. Here it is, with values of the relevant variables set explicitly to reproduce my test case (note this is just a toy code fragment which reproduces the essentials).

MODULE example

INTEGER,PARAMETER :: dp=kind(1.d0)
REAL(dp),ALLOCATABLE aaa(:,:,:)
REAL(dp),ALLOCATABLE :: basis_ri(:),basis_rj(:),basis_rij(:)

CONTAINS

 SUBROUTINE blah 
! Multiple subroutines call compute_f from within this module in the real code.
! Also blah and other module subroutines will itself be called multiple times from outside 
! the module.

! This array never modified - fixed coefficients.
  allocate(aaa(27,3,2)) ; aaa=1.123123123.d0
! These arrays modified inside every call to compute_f
  allocate(basis(ri(0:2)),basis_rj(0:2),basis_rij(0:2))

 netot=201 ; nitot=51

 do i=1,netot
  do ion=1,nitot
   set=1 ! some number between 1 and 2
   do j=1,netot
    call compute_f(value,set)
   enddo
  enddo
 enddo

 write(6,*)'Value = ',value
 return

 END SUBROUTINE blah


 SUBROUTINE compute_f(value,set)
 INTEGER,INTENT(in) :: set 
 REAL(dp),INTENT(out) :: value
 INTEGER Nee,NeN,lmn
 REAL(dp) valb,valc

  s=1 ! some number between 1 and 3
  Nee=2 ; NeN=2 ! always 2 here (could be something else in principle)
  basis_ri=0.234234234d0
  basis_rj=0.345345345d0
  basis_rij=0.456456456d0

!$acc data region copyin(aaa(:,s,set)) copyin(basis_ri(:)) &
!$acc copyin(basis_rj(:)) copyin(basis_rij(:))

  value=0.d0
  lmn=0
!$acc region
  do n=0,Nee
   valb=0.d0
   do m=0,NeN
    valc=0.d0
    do l=0,NeN
     lmn=lmn+1
     valc=valc+aaa(lmn,s,set)*basis_ri(l)
    enddo ! l
    valb=valb+valc*basis_rj(m)
   enddo ! m
   value=value+valb*basis_rij(n)
  enddo ! n
!$acc end region

!$acc end data region

 END SUBROUTINE compute_f

END MODULE example

I have now rewritten this along the lines Mat suggested. Thus:

MODULE example

INTEGER,PARAMETER :: dp=kind(1.d0)
REAL(dp),ALLOCATABLE aaa(:,:,:)
REAL(dp),ALLOCATABLE :: basis_ri(:),basis_rj(:),basis_rij(:)

CONTAINS

 SUBROUTINE blah ! multiple subroutines call compute_f in this module

! This array never modified - fixed coefficients.
  allocate(aaa(27,3,2)) ; aaa=1.123123123.d0
! These arrays modified inside every call to compute_f
  allocate(basis(ri(0:2)),basis_rj(0:2),basis_rij(0:2))

 allocate(valb(0:2,0:2),valc(0:2,0:2,0:2))

 netot=201 ; nitot=51

 do i=1,netot ! THIS LOOP ACTUALLY IN ANOTHER MODULE IN THE REAL CODE-
                     ! CONTAINS TOO MUCH STUFF TO BE INSIDE DATA REGION

 !$acc data region copyin(aaa)

 do ion=1,nitot
   set=1 ! some number between 1 and 2
   do j=1,netot
    call compute_f(value,set)
   enddo
  enddo

!$acc end data region

 enddo ! THIS LOOP IN ANOTHER MODULE

 write(6,*)'Value = ',value
 return

 END SUBROUTINE blah


 SUBROUTINE compute_f(value,set)
 INTEGER,INTENT(in) :: set
 REAL(dp),INTENT(out) value
 INTEGER Nee,NeN,lmn
 REAL(dp) valb,valc

  s=1 ! some number between 1 and 3
  Nee=2 ; NeN=2 ! always 2 here (could be something else in principle)
  basis_ri=0.234234234d0
  basis_rj=0.345345345d0
  basis_rij=0.456456456d0

!$acc data region copyin(basis_ri,basis_rj,basis_rij), local(valb,valc)

 value=0.d0
!$acc region
    do n=0,Nee
     do m=0,NeN
      do l=0,NeN
       lmn=(n*(Nee+1)*(NeN+1))+(m*(NeN+1))+l+1
       valc(n,m,l)=aaa(lmn,s,set)*basis_ri(l)
      enddo ! l
     enddo ! m
    enddo ! n
    do n=0,Nee
     do m=0,NeN
      valb(n,m)=0.d0
      do l=0,NeN
       valb(n,m)=valb(n,m)+valc(n,m,l)*basis_rj(m)
      enddo ! l
     enddo ! m
    enddo ! n
    do n=0,Nee
     do m=0,NeN
      value=value+valb(n,m)*basis_rij(n)
     enddo
    enddo ! n
!$acc end region

!$acc end data region

 END SUBROUTINE compute_f

END MODULE example

Here’s the accelerator report (for the real code):

8104, Generating copyin(aaa(:,:,:))
compute_f:
   9502, Generating local(valc(:,:,:))
         Generating local(valb(:,:))
         Generating copyin(basis_rij(:))
         Generating copyin(basis_rj(:))
         Generating copyin(basis_ri(:))
   9915, Generating copyin(aaa(:,s,set))
         Generating compute capability 2.0 binary
   9916, Loop is parallelizable
   9917, Loop is parallelizable
   9918, Loop is parallelizable
         Accelerator kernel generated
       9916, !$acc do parallel, vector(4) ! blockidx%y threadidx%y
       9917, !$acc do parallel, vector(4) ! blockidx%x threadidx%z
       9918, !$acc do vector(16) ! threadidx%x
             Cached references to size [16] block of 'basis_ri'
             CC 2.0 : 20 registers; 144 shared, 160 constant, 0 local memory bytes; 100% occupancy
   9924, Loop is parallelizable
   9925, Loop is parallelizable
         Accelerator kernel generated
       9924, !$acc do parallel, vector(32) ! blockidx%y threadidx%x
       9925, !$acc do parallel, vector(8) ! blockidx%x threadidx%y
             Cached references to size [8] block of 'basis_rj'
             CC 2.0 : 22 registers; 72 shared, 156 constant, 0 local memory bytes; 83% occupancy
   9927, Complex loop carried dependence of 'valb' prevents parallelization
         Loop carried reuse of 'valb' prevents parallelization
         Inner sequential loop scheduled on accelerator
   9932, Loop is parallelizable
   9933, Loop is parallelizable
         Accelerator kernel generated
       9932, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             Cached references to size [16] block of 'basis_rij'
       9933, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
             CC 2.0 : 16 registers; 2056 shared, 112 constant, 0 local memory bytes; 100% occupancy
       9934, Sum reduction generated for value_f

Results are as follows:

MY ORIGINAL CODE (1 core of CPU, no GPU)

Total energy -289.380005052746
CPU time : 17.11 sec

MAT’S SUGGESTED REWRITING OF THE FORTRAN (1 core of CPU, no GPU)

Total energy -289.380005052746
CPU time : 22.42 sec

The big slowdown here is undesirable, as our code is used to run on everything from big supercomputers with 10s of thousands of CPU cores to poor quality laptops with or without GPUs and it would be nice to use the same Fortran in all cases supplemented by accelerator derivatives where necessary. I appreciate this is a difficult thing to achieve. Clearly if Mat’s version proves to be better for the GPU case, then the slowdown is big enough that the original Fortran will have to be used in the non-GPU case, even though having two versions of the code is messy and hard to maintain (and how do you switch between them at run time in a portable way anyway?).

ADD FULL SET OF ACCELERATOR DIRECTIVES (1 core of CPU with GPU)

HERE aaa IS INCLUDED IN THE SAME data region AS basis_ri, basis_rj etc. i.e. NOT AS IN THE ABOVE CODE FRAGMENT.

Total energy -289.380005052746 (correct answer, this time…)
CPU time : 710.01 sec (41 times slower!)

Accelerator Kernel Timing data
  compute_f
    9915: region entered 2139814 times
        time(us): total=151192460 init=194964 region=150997496
                  kernels=58107115 data=0
        w/o init: total=150997496 max=470 min=68 avg=70
        9918: kernel launched 2139814 times
            grid: [1]  block: [16x4x4]
            time(us): total=19205748 max=77 min=6 avg=8
        9925: kernel launched 2139814 times
            grid: [1]  block: [32x8]
            time(us): total=14996218 max=78 min=7 avg=7
        9933: kernel launched 2139814 times
            grid: [1]  block: [16x16]
            time(us): total=14770591 max=78 min=6 avg=6
        9934: kernel launched 2139814 times
            grid: [1]  block: [256]
            time(us): total=9134558 max=76 min=4 avg=4
  compute_f
    9502: region entered 2340253 times
        time(us): total=696186744 init=1040667 region=695146077
                  data=51373056
        w/o init: total=695146077 max=757 min=227 avg=297

ADD FULL SET OF ACCELERATOR DIRECTIVES (1 core of CPU with GPU)

HERE aaa HAS ITS OWN data region OUTSIDE THE nitot LOOP, AS IN THE ABOVE CODE FRAGMENT.

Total energy -289.380005052746 (correct)
CPU time : 510.06 sec (30 times slower)

Accelerator Kernel Timing data
  compute_f
    9915: region entered 2139814 times
        time(us): total=179968284 init=155411 region=179812873
                  kernels=56122144 data=12882965
        w/o init: total=179812873 max=489 min=82 avg=84
        9918: kernel launched 2139814 times
            grid: [1]  block: [16x4x4]
            time(us): total=18756495 max=77 min=6 avg=8
        9925: kernel launched 2139814 times
            grid: [1]  block: [32x8]
            time(us): total=15001354 max=402 min=7 avg=7
        9933: kernel launched 2139814 times
            grid: [1]  block: [16x16]
            time(us): total=13319290 max=368 min=6 avg=6
        9934: kernel launched 2139814 times
            grid: [1]  block: [256]
            time(us): total=9045005 max=394 min=4 avg=4
    7415: region entered 4557 times
        time(us): total=3884 init=513 region=3371
                  data=32203
        w/o init: total=3371 max=257 min=0 avg=0
  compute_f
    9502: region entered 2340253 times
        time(us): total=495579249 init=186635 region=495392614
                  data=37623734
        w/o init: total=495392614 max=713 min=132 avg=211

So - we are at least getting the right answer with the GPU now, which is obviously a big improvement. However the timing data are clearly unacceptable. For this application we have an obvious rate limiting step taking up 80% of the CPU time, with a nice set of loops that are apparently parallelizable, yet we’re just making the whole thing 30 times slower than on a single CPU core. Hmmmm…

I guess the inner loops are considerably shorter than Mat was guessing, which probably has an effect on the efficiency. Also, I note that there is a:

9915, Generating copyin(aaa(:,s,set))

on encounteding the !$acc region statement. Not entirely sure why…

Does anyone have any further ideas of how we can improve the speed here?

Thanks again - Mat - for your help with this. Much appreciated.

Django

Hi Django,

9915, Generating copyin(aaa(:,s,set))
on encounteding the !$acc region statement. Not entirely sure why…

While you put “aaa” in a data region, there isn’t a corresponding “reflected” directive in the “compute_f” subroutine so the compiler doesn’t know that “aaa” has already been copied over. To fix, you either need to pass “aaa” into compute_f as a dummy argument and add a “reflected” directive, or use “mirror” instead. The “mirror” directive can be applied to a module allocatable array and creates an implicit data region having the same lifetime as scope as the variable. I would recommend using “mirror” here. Below is the updated code, however I did not fix the syntax error, just changed the directives. Note that “mirror” could also be used on your basis arrays.

MODULE example

INTEGER,PARAMETER :: dp=kind(1.d0)
REAL(dp),ALLOCATABLE aaa(:,:,:)
REAL(dp),ALLOCATABLE :: basis_ri(:),basis_rj(:),basis_rij(:)
!$acc mirror(aaa)

CONTAINS

 SUBROUTINE blah ! multiple subroutines call compute_f in this module

! This array never modified - fixed coefficients.
  allocate(aaa(27,3,2)) ; aaa=1.123123123.d0
 !$acc update device(aaa)

! These arrays modified inside every call to compute_f
  allocate(basis(ri(0:2)),basis_rj(0:2),basis_rij(0:2))

 allocate(valb(0:2,0:2),valc(0:2,0:2,0:2))

 netot=201 ; nitot=51

 do i=1,netot ! THIS LOOP ACTUALLY IN ANOTHER MODULE IN THE REAL CODE-
                     ! CONTAINS TOO MUCH STUFF TO BE INSIDE DATA REGION

!!! update aaa again if it changes in this loop.

 do ion=1,nitot
   set=1 ! some number between 1 and 2
   do j=1,netot
    call compute_f(value,set)
   enddo
  enddo

 enddo ! THIS LOOP IN ANOTHER MODULE

 write(6,*)'Value = ',value
 return

 END SUBROUTINE blah


 SUBROUTINE compute_f(value,set)
 INTEGER,INTENT(in) :: set
 REAL(dp),INTENT(out) value
 INTEGER Nee,NeN,lmn
 REAL(dp) valb,valc

  s=1 ! some number between 1 and 3
  Nee=2 ; NeN=2 ! always 2 here (could be something else in principle)
  basis_ri=0.234234234d0
  basis_rj=0.345345345d0
  basis_rij=0.456456456d0

!$acc data region copyin(basis_ri,basis_rj,basis_rij), local(valb,valc)

 value=0.d0
!$acc region
    do n=0,Nee
     do m=0,NeN
      do l=0,NeN
       lmn=(n*(Nee+1)*(NeN+1))+(m*(NeN+1))+l+1
       valc(n,m,l)=aaa(lmn,s,set)*basis_ri(l)
      enddo ! l
     enddo ! m
    enddo ! n
    do n=0,Nee
     do m=0,NeN
      valb(n,m)=0.d0
      do l=0,NeN
       valb(n,m)=valb(n,m)+valc(n,m,l)*basis_rj(m)
      enddo ! l
     enddo ! m
    enddo ! n
    do n=0,Nee
     do m=0,NeN
      value=value+valb(n,m)*basis_rij(n)
     enddo
    enddo ! n
!$acc end region

!$acc end data region

 END SUBROUTINE compute_f

END MODULE example

For the time being let’s assume that you an get the data transfer time down to zero. Is this code worth accelerating? If this is the only portion of code that will get accelerated, then the answer is no.

An accelerator really needs thousands of threads to see speed-up. Here, the best you’re getting is 6 (Nee=2 x NeN=2 x NeN=2). Worse, it gets called 2.1 million times so the calling overhead to the device is dominant. Ideally, you’d like to invert this where a kernel with 2.1 million threads is called 6 times.

Any way you can push the accelerator compute region out higher? If you can inline “compute_f”, either automatically by the compiler or manually, then the “netot” and “nitot” loops, with their much higher loop counts, would be much better candidates.

  • Mat