accelerator parallization issues

I am new to parallel programming. I have a legacy FORTRAN-77
code (about 15,000 lines) that models eclipsing binary stars.
The stars are divided up into rectangular grids in “longitude”
and “latitude”, and much of the computing time is spent
looping over the grid(s) and computing various things. Since the
GPUs don’t support subroutine and function calls, I have been
starting to manually inline subroutine calls.

This section of the code computes various physical quantities for
each pixel on the star, and saves the results into several arrays.
Each pixel is independent of every other pixel, so I would think
loops like these should be able to run in parallel. At
each pixel, there is a Newton-Raphson iteration to find the radius
vector.

!$acc region
do 1104 ialf=1,Nalph
r=0.0000001d0
theta=-0.5d0dtheta+dthetadble(ialf)
dphi=twopie/dble(4Nbet) !dble(ibetlim(ialf))
snth=dsin(theta)
snth3=dsin(theta)/3.0d0
cnth=dcos(theta)
DO 1105 ibet=1, 4
Nbet
iidx=mmdx(ialf,ibet)
phi=-0.5d0dphi+dphidble(ibet)
phi=phi+phistart(ialf)
cox=dcos(phi)snth
coy=dsin(phi)snth
coz=cnth
c begin in-line subroutine rad
do irad=1,190 !Newton-Raphson iteration
x=r
cox
y=r
coy
z=rcoz
if(itide.lt.2)then !in-line subroutine spherepot
t1=(bdist
bdist-2.0d0coxrbdist+rr)
t2=0.5d0omegaomega*(1.0d0+overQ)(1.0d0-cozcoz)
psi=1.0d0/r+overQ*(1.0d0/dsqrt(t1)-coxr/(bdistbdist))+rrt2
dpsidr=-1.0d0/(rr)+overQ((dsqrt(t1)3)(coxbdist-r)
% -cox/(bdistbdist))+t22.0d0*r
endif !end spherepot
rnew=r-(psi-psi0)/dpsidr
dr=dabs(rnew-r)
if(dabs(dr).lt.1.0d-19)go to 4115
r=rnew
enddo
4115 continue
if(itide.lt.2)then !in-line subroutine poten
RST = DSQRT(X
2 + Y2 + Z2)
RX = DSQRT((X-bdist)2 + Y2 + Z2)
A = ((1.0d0+overQ)/2.0d0) * OMEGA
2
RST3 = RSTRSTRST
RX3 = RXRXRX
PSI = 1.0d0/RST + overQ/RX - overQX/bdist/bdist
& + A
(X2 + Y2)
PSIY = -Y/RST3 - overQY/RX3 + 2.0d0AY
PSIZ = -Z/RST3 - overQ
Z/RX3
PSIX = -X/RST3 - overQ*(X-bdist)/RX3 -overQ/bdist/bdist
& + 2.0d0AX
RST5 = RST3RSTRST
RX5 = RX3RXRX
PSIXX = -1.0d0/RST3 + 3.0d0X**2/RST5
$ -overQ/RX3 + (3.0d0
Q*(X-bdist)2)/RX5 +2.0d0*A
endif !end poten !end rad
radarray(iidx) = R
garray(iidx) = DSQRT(PSIX
2+PSIY2+PSIZ2)
oneoverg=1.0d0/garray(iidx)
GRADX(iidx) = -PSIXoneoverg
GRADY(iidx) = -PSIY
oneoverg
GRADZ(iidx) = -PSIZoneoverg
surf(iidx) = COX
GRADX(iidx)+COYGRADY(iidx)
$ + COZ
GRADZ(iidx)
if(surf(iidx).lt.0.7d0)surf(iidx)=0.7d0
surf(iidx) = R**2 / surf(iidx)
surf(iidx)=surf(iidx)dphidthetasnth
xarray(iidx)=x
yarray(iidx)=y
zarray(iidx)=z
sarea=sarea+surf(iidx)
VOL = VOL + 1.0d0
RRRdphidtheta*snth3
1105 CONTINUE ! continue ialf loop
1104 CONTINUE ! continue over ibet
!$acc end region
!$acc end data region

I compile this command:
pgfortran -Mextend -O2 -o ELC ELC.for -ta=nvidia -Minfo

I get a lengthy list of messages:
5035, No parallel kernels found, accelerator region ignored
5046, Accelerator restriction: induction variable live-out from loop: ialf
5051, Loop carried scalar dependence for ‘rad1_psi’ at line 5123
Loop carried scalar dependence for ‘rad1_dpsidr’ at line 5123
Loop carried scalar dependence for ‘rad1_r’ at line 5066
Loop carried scalar dependence for ‘rad1_r’ at line 5068
Loop carried scalar dependence for ‘rad1_r’ at line 5070
Loop carried scalar dependence for ‘rad1_r’ at line 5123
Loop carried scalar dependence for ‘rad1_r’ at line 5124
5061, Loop carried scalar dependence for ‘rad1_psi’ at line 5123
Loop carried scalar dependence for ‘rad1_dpsidr’ at line 5123
Loop carried scalar dependence for ‘rad1_r’ at line 5066
Loop carried scalar dependence for ‘rad1_r’ at line 5068
Loop carried scalar dependence for ‘rad1_r’ at line 5070
Loop carried scalar dependence for ‘rad1_r’ at line 5123
Loop carried scalar dependence for ‘rad1_r’ at line 5124
5131, Accelerator restriction: induction variable live-out from loop: ialf
5187, No parallel kernels found, accelerator region ignored
5188, Loop carried scalar dependence for ‘psi’ at line 5261
Loop carried scalar dependence for ‘psiy’ at line 5342
Loop carried scalar dependence for ‘psiy’ at line 5345
Loop carried scalar dependence for ‘psiz’ at line 5342
Loop carried scalar dependence for ‘psiz’ at line 5346
Loop carried scalar dependence for ‘psix’ at line 5342
Loop carried scalar dependence for ‘psix’ at line 5344
Complex loop carried dependence of ‘radarray’ prevents parallelization
Complex loop carried dependence of ‘garray’ prevents parallelization
Complex loop carried dependence of ‘gradx’ prevents parallelization
Complex loop carried dependence of ‘grady’ prevents parallelization
Complex loop carried dependence of ‘gradz’ prevents parallelization
Complex loop carried dependence of ‘surf’ prevents parallelization
Complex loop carried dependence of ‘xarray’ prevents parallelization
Complex loop carried dependence of ‘yarray’ prevents parallelization
Complex loop carried dependence of ‘zarray’ prevents parallelization
Scalar last value needed after loop for ‘z’ at line 5475
Loop carried scalar dependence for ‘dpsidr’ at line 5261
Accelerator restriction: scalar variable live-out from loop: r
Accelerator restriction: scalar variable live-out from loop: psi
Accelerator restriction: scalar variable live-out from loop: z
Accelerator restriction: scalar variable live-out from loop: y
Accelerator restriction: scalar variable live-out from loop: x
Accelerator restriction: scalar variable live-out from loop: psixx
Accelerator restriction: scalar variable live-out from loop: psix
Accelerator restriction: scalar variable live-out from loop: psiz
Accelerator restriction: scalar variable live-out from loop: psiy
Accelerator restriction: scalar variable live-out from loop: coz
Accelerator restriction: scalar variable live-out from loop: coy
Accelerator restriction: scalar variable live-out from loop: cox
5190, Accelerator restriction: induction variable live-out from loop: ialf
5195, Loop carried scalar dependence for ‘psi’ at line 5261
Loop carried scalar dependence for ‘psiy’ at line 5342
Loop carried scalar dependence for ‘psiy’ at line 5345
Loop carried scalar dependence for ‘psiz’ at line 5342
Loop carried scalar dependence for ‘psiz’ at line 5346
Loop carried scalar dependence for ‘psix’ at line 5342
Loop carried scalar dependence for ‘psix’ at line 5344
Complex loop carried dependence of ‘radarray’ prevents parallelization
Complex loop carried dependence of ‘garray’ prevents parallelization
Loop carried dependence due to exposed use of ‘garray(:)’ prevents parallelization
Complex loop carried dependence of ‘gradx’ prevents parallelization
Loop carried dependence due to exposed use of ‘gradx(:)’ prevents parallelization
Complex loop carried dependence of ‘grady’ prevents parallelization
Loop carried dependence due to exposed use of ‘grady(:)’ prevents parallelization
Complex loop carried dependence of ‘gradz’ prevents parallelization
Loop carried dependence due to exposed use of ‘gradz(:)’ prevents parallelization
Complex loop carried dependence of ‘surf’ prevents parallelization
Loop carried dependence due to exposed use of ‘surf(:)’ prevents parallelization
Complex loop carried dependence of ‘xarray’ prevents parallelization
Complex loop carried dependence of ‘yarray’ prevents parallelization
Complex loop carried dependence of ‘zarray’ prevents parallelization
Scalar last value needed after loop for ‘z’ at line 5475
Loop carried scalar dependence for ‘dpsidr’ at line 5261
Loop carried scalar dependence for ‘r’ at line 5210
Loop carried scalar dependence for ‘r’ at line 5211
Loop carried scalar dependence for ‘r’ at line 5212
Loop carried scalar dependence for ‘r’ at line 5214
Loop carried scalar dependence for ‘r’ at line 5216
Loop carried scalar dependence for ‘r’ at line 5217
Loop carried scalar dependence for ‘r’ at line 5261
Loop carried scalar dependence for ‘r’ at line 5262
Loop carried scalar dependence for ‘r’ at line 5341
Loop carried scalar dependence for ‘r’ at line 5357
Loop carried scalar dependence for ‘r’ at line 5375
Accelerator restriction: scalar variable live-out from loop: r
Accelerator restriction: scalar variable live-out from loop: psi
Accelerator restriction: scalar variable live-out from loop: z
Accelerator restriction: scalar variable live-out from loop: y
Accelerator restriction: scalar variable live-out from loop: x
Accelerator restriction: scalar variable live-out from loop: psixx
Accelerator restriction: scalar variable live-out from loop: psix
Accelerator restriction: scalar variable live-out from loop: psiz
Accelerator restriction: scalar variable live-out from loop: psiy
Accelerator restriction: scalar variable live-out from loop: coz
Accelerator restriction: scalar variable live-out from loop: coy
Accelerator restriction: scalar variable live-out from loop: cox
Parallelization would require privatization of array ‘zarray(:)’
Parallelization would require privatization of array ‘yarray(:)’
Parallelization would require privatization of array ‘xarray(:)’
Parallelization would require privatization of array ‘radarray(:)’
Invariant if transformation
5196, Accelerator restriction: induction variable live-out from loop: ialf
5209, Scalar last value needed after loop for ‘x’ at line 5268
Scalar last value needed after loop for ‘x’ at line 5269
Scalar last value needed after loop for ‘x’ at line 5273
Scalar last value needed after loop for ‘x’ at line 5277
Scalar last value needed after loop for ‘x’ at line 5281
Scalar last value needed after loop for ‘x’ at line 5349
Scalar last value needed after loop for ‘y’ at line 5268
Scalar last value needed after loop for ‘y’ at line 5269
Scalar last value needed after loop for ‘y’ at line 5273
Scalar last value needed after loop for ‘y’ at line 5275
Scalar last value needed after loop for ‘y’ at line 5350
Scalar last value needed after loop for ‘z’ at line 5268
Scalar last value needed after loop for ‘z’ at line 5269
Scalar last value needed after loop for ‘z’ at line 5276
Scalar last value needed after loop for ‘z’ at line 5351
Scalar last value needed after loop for ‘z’ at line 5475
Loop carried scalar dependence for ‘psi’ at line 5261
Loop carried scalar dependence for ‘dpsidr’ at line 5261
Loop carried scalar dependence for ‘r’ at line 5210
Loop carried scalar dependence for ‘r’ at line 5211
Loop carried scalar dependence for ‘r’ at line 5212
Loop carried scalar dependence for ‘r’ at line 5214
Loop carried scalar dependence for ‘r’ at line 5216
Loop carried scalar dependence for ‘r’ at line 5217
Loop carried scalar dependence for ‘r’ at line 5261
Loop carried scalar dependence for ‘r’ at line 5262
Scalar last value needed after loop for ‘r’ at line 5341
Scalar last value needed after loop for ‘r’ at line 5357
Scalar last value needed after loop for ‘r’ at line 5375
Accelerator restriction: scalar variable live-out from loop: r
Accelerator restriction: scalar variable live-out from loop: psi
Accelerator restriction: scalar variable live-out from loop: z
Accelerator restriction: scalar variable live-out from loop: y
Accelerator restriction: scalar variable live-out from loop: x


I suspect I need to the arrays like gradx, grady, need to be saved. However, I
don’t understand the complex dependencies, like why this error occurs:

Accelerator restriction: induction variable live-out from loop: ialf

The value of the latitude of the pixel (theta) is set from the index.
The longitude (phi) is also set from an index, but the compiler does
not care about that.

In this code, I need to compute values for various pixels and save them in arrays
in a few other places, so I would appreciate any advice on how this can be
done in parallel.

I am using pgfortran 10.3. We have a Linux box with a quadcore i7 and two
Nvidia C1060 cards.

Thanks,

Jerry

P.S. upon previewing, the spacing in the code snippet is messed up.
Hopefully it is still understandable.

Hi Jerry,

Let’s walk through a few of these messages for the code shown.

Loop carried scalar dependence for ‘psiy’ at line 5342

The “PSI” variables are all initialized within an IF statement. In serial code this is fine since the values from the previous iteration will be used. In parallel, scalars need to be “privatized” so that each thread has their own copy and can not be dependent upon previous iterations. To fix, you need to initialize these variables outside the IF statement or in an else clause.

Complex loop carried dependence of ‘radarray’ prevents parallelization

For these, the problem is your use of the calculated index “iidx”. At compile time, there is no way to know that the values of “iidx=mmdx(ialf,ibet)” are independent. The compiler must assume the worst case where all values of mmdx would the same and all threads are updating the same “iidx”. In this case, the actual value returned would be non-deterministic.

To fix, you’ll need to create temporary results arrays to store the values for each thread. Then perform the gather operation serially on the host.

Accelerator restriction: scalar variable live-out from loop: r

The compiler will attempt to parallelize all DO loops. However, you’re using variables calculated in the innermost loop in the middle loop. This makes them “live” (the value is used on the right-hand size) and again you could have non-deterministic results depending upon which thread’s value is used.

In this case, you actually want the inner most loop to be sequential. To fix, add the “do kernel” clause above the “ibet” loop:

!$acc do kernel
DO 1105 ibet=1, 4*Nbet

This tells the compiler to still parallelize the ibet loop but everything in the loop becomes the body of the device kernel code and is executed sequentially.

Hope this helps,
Mat

Hi Mat,

Thanks for the help. I easily fixed two of the problems by removing
the if-statement and adding the !$acc do kernel directive.
The problem with the iidx array counter can be solved by
changing the arrays back to two dimensions. Several years ago
I made most of the 2-dimensional arrays one dimensional to
speed up the performance, and I suppose it is necessary to change
them back if these loops are to be parallelized.

I managed to get a similar loop structure to compile for parallel
execution. However, the code takes much longer to run. The run
time is (on the wall clock) is about 17 seconds when the code is
compiled for the host CPU only:

pgfortran -Mextend -O2 -o serialELC ELC.for

The run time goes up to about 35 seconds when compiled for the GPU:

pgfortran -Mextend -O2 -o ELC ELC.for -ta=nvidia,time

Here is the loop in the subroutine:

           Na=120
           Nb=40
           dtheta=pie/dble(Na)
           vol=0.0d0
!$acc region
           do 104 ialf=1,Na
             r=1.0d-12
             psi=1.0d0
             dpsidr=1.0d0
             theta=-0.5d0*dtheta+dtheta*dble(ialf)
             dphi=twopie/dble(4*Nb)
             snth=dsin(theta)
             snth3=snth*(1.0d0/3.0d0)
             cnth=dcos(theta)
!$acc do kernel
             DO 105 ibet=1,Nb*4         
               phi=-0.5d0*dphi+dphi*dble(ibet)
               cox=dcos(phi)*snth             
               coy=dsin(phi)*snth             
               coz=cnth                       
               do irad=1,190    !subroutine rad
                 x=r*cox
                 y=r*coy
                 z=r*coz
                 t1=(bdist*bdist-2.0d0*cox*r*bdist+r*r)
                 t2=0.5d0*omega*omega*(1.0d0+overQ)*(1.0d0-coz*coz)
                 psi=1.0d0/r+overQ*(1.0d0/dsqrt(t1)-cox*r/(bdist*bdist))+r*r*t2
                 dpsidr=-1.0d0/(r*r)+overQ*((dsqrt(t1)**3)*(cox*bdist-r)
     %             -cox/(bdist*bdist))+t2*2.0d0*r
                 rnew=r-(psi-psi0)/dpsidr
                 dr=dabs(rnew-r)
                 if(dabs(dr).lt.1.0d-19)go to 1115
                 r=rnew
               enddo
1115           continue    
c              
               VOL = VOL + 1.0d0*R*R*R*dphi*dtheta*snth3
105          CONTINUE    ! continue ibet loop
104        CONTINUE                   ! continue over ialf
!$acc end region

Here is the compile command:

pgfortran -Mextend -O2 -o ELC ELC.for -ta=nvidia,time -Minfo=accel

The compiler spit out these messages:

      19524, Generating compute capability 1.3 kernel
       19525, Loop is parallelizable
              Accelerator kernel generated
           19525, !$acc do parallel, vector(120)
           19562, Sum reduction generated for vol
       19536, Loop carried scalar dependence for 'r' at line 19549
              Loop carried scalar dependence for 'r' at line 19551
              Loop carried scalar dependence for 'r' at line 19552
              Loop carried scalar dependence for 'r' at line 19555
              Loop carried scalar dependence for 'r' at line 19556
              Loop carried scalar dependence for 'r' at line 19562
       19545, Loop carried scalar dependence for 'r' at line 19549
              Loop carried scalar dependence for 'r' at line 19551
              Loop carried scalar dependence for 'r' at line 19552
              Loop carried scalar dependence for 'r' at line 19555
              Loop carried scalar dependence for 'r' at line 19556
              Scalar last value needed after loop for 'r' at line 19562
              Accelerator restriction: scalar variable live-out from loop: r

I guess the compiler is relatively happy, since it made a kernel for the GPU. When I run the code, I get the following:

  setupgeo
    5382: region entered 2 times
        time(us): init=0
/home/orosz/lightcurve/./lcsubs.for
  findradius
    19524: region entered 122 times
        time(us): total=24353792 init=52636 region=24301156
                  kernels=24291961 data=9195
        w/o init: total=24301156 max=379268 min=9607 avg=199189
        19525: kernel launched 122 times
            grid: [1]  block: [120]
            time(us): total=24289311 max=379188 min=9533 avg=199092
        19562: kernel launched 122 times
            grid: [1]  block: [256]  
            time(us): total=2650 max=105 min=15 avg=21

If I am reading this correctly, the GPU spend about 24 seconds
computing the loops. As noted above, the host CPU runs the entire
code in about 17 seconds, and this includes lots of other stuff in
addition to the 122 calls to the subroutine. The time spend moving
data in and out of memory appears to be minimal, as does the
execution time.

As mentioned in the original post, the system is a quadcore i7 with
two Nvidia C1060 cards:

> pgaccelinfo
CUDA Driver Version            2030

Device Number:                 0
Device Name:                   Tesla C1060
Device Revision Number:        1.3
Global Memory Size:            4294705152
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1296 MHz
Initialization time:           9430 microseconds
Current free memory            4246142976
Upload time (4MB)               891 microseconds ( 714 ms pinned)
Download time                   984 microseconds ( 731 ms pinned)
Upload bandwidth               4707 MB/sec (5874 MB/sec pinned)
Download bandwidth             4262 MB/sec (5737 MB/sec pinned)

Device Number:                 1
Device Name:                   Quadro FX 380
Device Revision Number:        1.1
Global Memory Size:            268107776
Number of Multiprocessors:     2
Number of Cores:               16
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           8192
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1100 MHz
Initialization time:           9430 microseconds
Current free memory            220147712
Upload time (4MB)              1535 microseconds (1357 ms pinned)
Download time                  1481 microseconds (1273 ms pinned)
Upload bandwidth               2732 MB/sec (3090 MB/sec pinned)
Download bandwidth             2832 MB/sec (3294 MB/sec pinned)

Device Number:                 2
Device Name:                   Tesla C1060
Device Revision Number:        1.3
Global Memory Size:            4294705152
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1296 MHz
Initialization time:           9430 microseconds
Current free memory            4246142976
Upload time (4MB)              1515 microseconds (1345 ms pinned)
Download time                  1469 microseconds (1272 ms pinned)
Upload bandwidth               2768 MB/sec (3118 MB/sec pinned)
Download bandwidth             2855 MB/sec (3297 MB/sec pinned)

The two Tesla cards should have the capability to run double precision.
When I set the ACC_NOTIFY environment variable, I get messages like

launch kernel file=/home/orosz/lightcurve/./lcsubs.for function=findradius line=19525 device=0 grid=1 block=120
launch kernel file=/home/orosz/lightcurve/./lcsubs.for function=findradius line=19562 device=0 grid=1 block=256

Apparently, the system is using the Tesla card at device 0 to compute
the loops, and not the video card at device=1 that runs the
screen saver.

My bottom line question: Why are these parallel loops running so slow?

Thanks in advance,

Jerry

Hi Jerry,

Next, I’d try adjusting the schedule using the “!$acc do” “parallel” and “vector” clauses. (Such as !$acc do parallel(4), vector(30)") Though, I’m not sure how much this will help.

Are you able to increase the value of “Na”? I’m concerned that 120 is too small to see any significant speed-up. While there are a lot of cores on a Tesla, individually they aren’t very fast. If not, you may consider using OpenMP instead and run your code on a multi-core CPU.

  • Mat

Hi Mat,

I cranked up the values of Na and Nb, but the difference between the serial and parallel versions becomes even worse.

I am wondering what exactly is being made parallel in this case. Here is the current code part. I tried the “do parallel” directives (briefly) but they did not help, so I have them commented out here (I really should read up on exactly how they are used.)


!$acc data region
!$acc region
c!$acc do parallel
           do 104 ialf=1,Na
c
             r=1.0d-12
             psi=1.0d0
             dpsidr=1.0d0
             theta=-0.5d0*dtheta+dtheta*dble(ialf)
             dphi=twopie/dble(4*Nb)
             snth=dsin(theta)
             snth3=snth*(1.0d0/3.0d0)
             cnth=dcos(theta)
c!$acc do parallel
             DO 105 ibet=1,Nb*4        
               phi=-0.5d0*dphi+dphi*dble(ibet)
               cox=dcos(phi)*snth             
               coy=dsin(phi)*snth             
               coz=cnth                       
!$acc do kernel
               do irad=1,60   !160    !subroutine rad
                 x=r*cox
                 y=r*coy
                 z=r*coz
                 t1=(bdist*bdist-2.0d0*cox*r*bdist+r*r)
                 t2=0.5d0*omega*omega*(1.0d0+overQ)*(1.0d0-coz*coz)
                 psi=1.0d0/r+overQ*(1.0d0/dsqrt(t1)-cox*r/(bdist*bdist))+r*r*t2
                 dpsidr=-1.0d0/(r*r)+overQ*((dsqrt(t1)**3)*(cox*bdist-r)
     %             -cox/(bdist*bdist))+t2*2.0d0*r
                 rnew=r-(psi-psi0)/dpsidr
                 dr=dabs(rnew-r)
c                 if(dabs(dr).lt.1.0d-15)go to 1115
                 r=rnew
               enddo
1115           continue    
               VOL = VOL + 1.0d0*R*R*R*dphi*dtheta*snth3
105          CONTINUE    

104        CONTINUE                  
!$acc end region
!$acc end data region

Here is the compiler message:

       19528, Generating compute capability 1.3 kernel
       19530, Loop is parallelizable
              Accelerator kernel generated
           19530, !$acc do parallel, vector(256)
           19562, Sum reduction generated for vol
       19541, Loop carried scalar dependence for 'r' at line 19551
              Loop carried scalar dependence for 'r' at line 19553
              Loop carried scalar dependence for 'r' at line 19554
              Loop carried scalar dependence for 'r' at line 19556
       19547, Loop carried scalar dependence for 'r' at line 19551
              Loop carried scalar dependence for 'r' at line 19553
              Loop carried scalar dependence for 'r' at line 19554
              Loop carried scalar dependence for 'r' at line 19556
              Scalar last value needed after loop for 'r' at line 19562
              Accelerator restriction: scalar variable live-out from loop: r
              Inner sequential loop scheduled on accelerator

There is still an accelerator restriction message. I am not sure if this matters with regards to the actual performance, but I suppose at some level it must, since the performance gets so bad. I have compiled and ran some of the simple test codes from the PGI pages (from the 4 part series by Michael Wolfe), and the codes run on the GPU outperforms the same code ran on the CPU.

Finally, I also commented out the statement in the innermost loop that bails when the convergence is reached. This ensures that the same number of operations are done in serial and in parallel mode. In the serial mode, only a few iterations are needed once the first iteration at a given latitude row it done. So the comparison of run times should be good since the same number of operations are done in each case.

Jerry

Hi Jerry,

The “ibet” loop isn’t parallelizable due to ‘R’ so the compiler is scheduling it sequentially on the accelerator. It’s essentially the same as you using the “kernel” directive from before.

One thing to try is “-ta=nvidia,fastmath”. This will use less precise but much faster math intrinsics. Also, in 10.4 (due out later this week), “fastmath” will use a less precise divide. Given that your code uses many divides, this could have a large performance impact

  • Mat

Hi Mat,

I am having trouble understanding the restriction. The grid elements on the star are all independent. I took the same code, and commented out the “do irad” loop start and stop commands. Everything from “x=r*cox” down to “r=rnew” is done only once. The compiler tells me this:

       19528, Generating compute capability 1.3 kernel
       19530, Loop is parallelizable
              Accelerator kernel generated
           19530, !$acc do parallel, vector(256)
           19563, Sum reduction generated for vol
       19541, Loop is parallelizable

Does this mean that both of the ialf and ibet loops are now parallel? If so, then in principle I should be able to cut and paste the code inside the irad loop and hardwire the loop manually. I just tried this with the code repeated twice and with the code repeated three times and I got the same compiler message.

Rather than cut and paste 160 times, is there a way to tell the compiler to “unroll” that loop specifically, and to not worry about the scalar dependence on r? Most cases need only 4 or 5 trips through the loop, but there are cases that need a lot more.

One thing to try is “-ta=nvidia,fastmath”. This will use less precise but much faster math intrinsics. Also, in 10.4 (due out later this week), “fastmath” will use a less precise divide. Given that your code uses many divides, this could have a large performance impact

  • Mat

The fastmath option made a marginal difference without any change in the output files. This is an option to keep in mind when tuning things. I will look forward to the updated release.

This forum has been a big help. Thanks for running it.

Thanks,

Jerry

Hi Jerry,

“R” is initialized in the outermost loop (ialf) and can change in the innermost loop (irad). Hence, the starting value of “R” for each iteration of the middle loop (ibet) will be depend upon the previous iteration. This dependency prevents the ibet loop from being parallelized.

You can fix this by initializing R inside the middle loop, but your answers could be different.

Does this mean that both of the ialf and ibet loops are now parallel?

It appears so.

Rather than cut and paste 160 times, is there a way to tell the compiler to “unroll” that loop specifically,

Not yet, but soon. We’re in the process of implementing a the “unroll” clause for “!$acc do” directive.

Though, trying to unroll a loop 160 times may cause other problems such as using too many registers. Instead, try moving the initialization of R inside the ibet loop and check that you’re still getting correct answers.

This forum has been a big help. Thanks for running it.

You’re welcome.

  • Mat

Hi Mat,

I think I have everything sorted out. I went ahead and cut and paste the code inside the irad loop about 100 times (it is not like I have to put a quarter in the slot for each paste). The compiler seemed happy and made both loops parallel.

The problem is that it turns out that this subroutine really is not a good candidate for parallelization. In serial mode, it is reasonably efficient. At each ialf, the initial operation count to get the radius might be a few dozen trips through the Newton-Raphson loop. However, once this is done, each iteration in the ibet loop uses only a few trips, since the radius does not change much from one pixel to the next. When both loops are parallel, the operation count in the Newton-Raphson loop can be quite large for all pixels since a global initial guess is used.

Next, I tried to make only the ibet loop parallel. At each ialf, I find the initial r, and use this as the initial guess inside the ibet loop. This actually works. However, to see a time savings on the wall clock requires a very large value of Nb. For example, if Nb=960, the parallel kernel finishes in about 4.5 seconds. In serial mode it takes 25 seconds for all of the calls to the subroutine to be completed. When Nb=3200, the times are 7.8 and 84 seconds, respectively. Finally, running it into the ground, Na=9600 gives times of 16 seconds and 259 seconds, respectively.

In normal use, a value of Nb=48 is sufficient. The time for the parallel kernel is 3.64 seconds (2.8 accelerator kernel time). In serial mode the subroutine takes 1.1 seconds. By playing around with the

!$acc do parallel, vector()

directive, I can get the kernel time down to about 1.5 seconds. However, the total region time always seems to be stuck at about 2.5 seconds. The initialization of the card takes 0.07 seconds, and about 0.7 seconds is used to move data between the host and the card.

For large Nb, the host CPU time is 10 or more times longer than the GPU kernel time. Why is this not the case for small Nb? Why won’t the region time go down to about 0.8 seconds (the initialization time and the data moving time)?

Thanks,

Jerry

As a postscript, here is the output the program gives after running (I compiled using -ta=nvidia,time):

For Nb=48:

Accelerator Kernel Timing data
/home/orosz/lightcurve/./lcsubs.for
  setupgeo
    5382: region entered 2 times
        time(us): init=0
/home/orosz/lightcurve/./lcsubs.for
  findradius
    19545: region entered 15616 times
        time(us): total=3564618 init=50706 region=3513912
                  kernels=2779746 data=734166
        w/o init: total=3513912 max=9789 min=126 avg=225
        19547: kernel launched 15616 times
            grid: [1]  block: [192]
            time(us): total=2540559 max=9712 min=63 avg=162
        21715: kernel launched 15616 times
            grid: [1]  block: [256]
            time(us): total=239187 max=70 min=14 avg=15

For Nb=240:

Accelerator Kernel Timing data
/home/orosz/lightcurve/./lcsubs.for
  setupgeo
    5382: region entered 2 times
        time(us): init=0
/home/orosz/lightcurve/./lcsubs.for
  findradius
    19545: region entered 15616 times
        time(us): total=4028584 init=47359 region=3981225
                  kernels=3265664 data=715561
        w/o init: total=3981225 max=1429 min=136 avg=254
        19547: kernel launched 15616 times
            grid: [4]  block: [256]
            time(us): total=3010606 max=1367 min=75 avg=192
        21715: kernel launched 15616 times
            grid: [1]  block: [256]
            time(us): total=255058 max=107 min=15 avg=16

Nb=480:

Accelerator Kernel Timing data
/home/orosz/lightcurve/./lcsubs.for
  setupgeo
    5382: region entered 2 times
        time(us): init=0
/home/orosz/lightcurve/./lcsubs.for
  findradius
    19545: region entered 15616 times
        time(us): total=4311237 init=53790 region=4257447
                  kernels=3519184 data=738263
        w/o init: total=4257447 max=1768 min=138 avg=272
        19547: kernel launched 15616 times
            grid: [8]  block: [256]
            time(us): total=3264275 max=1364 min=76 avg=209
        21715: kernel launched 15616 times
            grid: [1]  block: [256]
            time(us): total=254909 max=312 min=15 avg=16

In round numbers, it appears always to take about 0.7 seconds to move data, and about 0.3 seconds to do the parallel sum. For small Nb, the actual kernel compute time seems longer than it should be, based on the performance for large Nb. There seems to be some “start-up” cost that is not accounted for in the above numbers.

Jerry

Hi Jerry,

There is some overhead in launching kernels. The exact amount per kernel call varies but is usually small (<100 ms). In you case though, since the overall run time is very small and the number of kernel calls is large (15616), this overhead is having a greater impact.

Your next step is to try and reduce the number of kernel calls. Can you add the “ialf” back to the accelerator region?

  • Mat

Hi Mat,

When I move the ialf loop back into the region, the performance drops. Basically the kernel time is always similar to the time it takes the loop to run in serial mode. There are fewer kernel calls, but I think the operation count in the inner Newton-Raphson loop is always high.

Although this subroutine was a poor one for parallelization, I did learn a lot from trying. Other loops will require more function inlining and the conversion of 1-D arrays back into 2-D, etc, and as such will be more involved.

I don’t know hard this is to do, but it would be very convenient to have the ability to call subroutines in the accelerator regions (I suppose this is not that easy do given that it is currently not a feature). If this is not an option, it would be nearly as neat if the same easy compiler directives could be used to make code parallel on a multicore CPU.

Jerry

Hi Jerry,

I don’t know hard this is to do, but it would be very convenient to have the ability to call subroutines in the accelerator region

Calling isn’t supported on GPUs since there isn’t a linker. Though, you can try using “-Minline” to inline your subroutines.

If this is not an option, it would be nearly as neat if the same easy compiler directives could be used to make code parallel on a multicore CPU.

While the PGI Accelerator model currently only supports NVIDIA, our next target will be x86 multi-core systems.

  • Mat

Hi Jerry,

Can you try your code again with PGI 10.4? “-ta=nvidia,fastmath” now uses a less precise but much faster divide. Another code, WRF, sees a 3x speed-up of the accelerator computation. Given the number of divides in your code, it will most likely help your code as well. Check your answers, though.

  • Mat

Hi Mat,

Thanks for the pointer. A factor of 3 is nothing to sneeze at. I am working with a mathematician on improved routines to interpolate values out of a large table and to perform numerical integration, so we will definitely pay attention to the accuracy. However, using 10.3, the fastmath option did not change the numerical results, so I am hopeful.

By the way, do you have your hands on a new Nvidia Fermi card? I have my eye on the C2050, which looks like it is just been released. I would like to see some benchmarks that compare the current C1050 cards to the new ones using PGI on FORTRAN codes, both large and small, and single and double precision.

Cheers,

Jerry

Hi Jerry,

I have my eye on the C2050, which looks like it is just been released. I would like to see some benchmarks that compare the current C1050 cards to the new ones using PGI on FORTRAN codes, both large and small, and single and double precision.

I do have a C2050 but don’t have any benchmark comparisons. Maybe someone else?

  • Mat

Hi Mat,

We have 10.4 installed. The code compiles, but I get this when trying to run:

call to EventCreate returned error 2: Out of memory

We have CUDA 10.3, although it is not clear the PGI compilers are using it. When I run pgaccelinfo, I get this:



CUDA Driver Version            2030

Device Number:                 0
Device Name:                   Tesla C1060
Device Revision Number:        1.3
Global Memory Size:            4294705152
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1296 MHz
Initialization time:           7911 microseconds
Current free memory            4246142976
Upload time (4MB)               866 microseconds ( 715 ms pinned)
Download time                   954 microseconds ( 734 ms pinned)
Upload bandwidth               4843 MB/sec (5866 MB/sec pinned)
Download bandwidth             4396 MB/sec (5714 MB/sec pinned)

Does that top line indicate a CUDA version of 2.03?

The 10.3 binary is still there, and when I use it, the code runs without an error.

Thanks,

Jerry

Hi Jerry,

Does that top line indicate a CUDA version of 2.03?

It means that your driver supports CUDA 2.3. A “3000” means CUDA 3.0.

We have CUDA 10.3, although it is not clear the PGI compilers are using it.

I’m assuming you mean CUDA 3.0. By default “-ta=nvidia” will use CUDA 2.3. If you’re adding “-ta=nvidia,cuda3.0” then you will need to update your driver.

What flags are you using? If nothing changed except the compiler version, then we’ve got a bug. If this is the case, can you send a report to PGI Customer Support (trs@pgroup.com) and include the code?

Thanks,
Mat

Hi Mat,

I meant CUDA 2.3 in the previous message.

We have CUDA 3.0 installed, and pgaccelinfo gives the “3000” string at the top. I no longer get the “out of memory error”. In fact, at this point, I cannot make that error come back.

Previously, before we changed CUDA, the 10.4 version of pgfortran would produce the error, and the 10.3 version would not. The CUDA version was 2.3.


That said, the -fastmath option did not seem to improve the performance. Also, the card takes 2.5 seconds to initialize, compared to about 0.2 seconds for the previous CUDA.

Jerry