Hi there,

I’m new to PGI Accelerator compilers, and I’m struggling with getting the acceleration to work. My machine has two 6-core Intel Xeon X5680 CPUs with a single GeForce GTX 480 GPU, and I’m running a well-known quantum chemistry code written in Fortran95 with extensive use of MPI.

For the application in question, the code spends 80% of its time running the following loop and I’m understandably trying to accelerate it (in the way shown):

```
!$acc data region copyin(aaa(:,s,set)) copyin(basis_ri(:)) &
!$acc copyin(basis_rj(:)) copyin(basis_rij(:))
<snip>
value=0.d0
lmn=0
!$acc region
do n=0,Nee
valb=0.d0
do m=0,NeN
valc=0.d0
do l=0,NeN
lmn=lmn+1
valc=valc+aaa(lmn,s,set)*basis_ri(l)
enddo ! l
valb=valb+valc*basis_rj(m)
enddo ! m
value=value+valb*basis_rij(n)
enddo ! n
!$acc end region
!$acc end data region
```

I compile this with:

```
mpif90 -ta=nvidia:cc20,time,cuda4.1 -fast -Minfo=accel -Mprof=time,lines,func
```

giving

```
9486, Generating copyin(basis_rij(:))
Generating copyin(basis_rj(:))
Generating copyin(basis_ri(:))
Generating copyin(aaa(:,s,set))
9900, Generating compute capability 2.0 binary
9901, Loop is parallelizable
Accelerator kernel generated
9901, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
CC 2.0 : 26 registers; 2052 shared, 176 constant, 0 local memory bytes; 66% occupancy
9911, Sum reduction generated for value
9903, Loop is parallelizable
9905, Inner sequential loop scheduled on accelerator
9906, Accelerator restriction: multilevel induction variable: lmn
```

Timings and answers are as follows (using only 1 core of the CPU):

NO ACCELERATION

Total energy (au) = -289.380005052746

Total CPU time : 17.5300

ACCELERATOR DIRECTIVES AS SHOWN

Total energy (au) = 16339512.037687273696 Oops!

Total CPU time : 288.3300 (i.e. 17 times slower…)

ACCELERATOR REGION AROUND THE M LOOP

Total energy (au) = 16339512.037687409669

Total CPU time : 394.3000

ACCELERATOR REGION AROUND THE L LOOP

Total energy (au) = 16339512.037687413394

Total CPU time : 686.0500

The Accelerator timing data for ‘ACCELERATOR DIRECTIVES AS SHOWN ABOVE’ is as follows:

```
Accelerator Kernel Timing data
9900: region entered 1395318 times
time(us): total=61087595 init=108644 region=60978951
kernels=21364685 data=0
w/o init: total=60978951 max=441 min=42 avg=43
9901: kernel launched 1395318 times
grid: [1] block: [256]
time(us): total=14579005 max=80 min=8 avg=10
9911: kernel launched 1395318 times
grid: [1] block: [256]
time(us): total=6785680 max=20 min=4 avg=4
9486: region entered 1569865 times
time(us): total=277645992 init=707668 region=276938324
data=33027038
w/o init: total=276938324 max=727 min=136 avg=176
```

So the accelerated code (a) gives the wrong answer, and (b) is hugely slower. OK, now I admit I’m still in my stage 1 of this process - which is ‘stick in the acc region statements and see what happens’ without any serious analysis of precisely what it’s doing.

Now that’s fine (because that’s kind of the point of the accelerator model) but I always expected to go into it more deeply in stage 2 where I would optimize the things to make it go faster. However, I didn’t expect things to screw up so badly at the beginning.

So my question:

(a) is there any obvious reason why this should happen?

(b) does anyone have any tips for accelerating this loop using only accelerator directives?

I’m very grateful in advance for your time.

Cheers,

Django