# Six Loops iteration and reduction

I’m a beginner of PVF. To accelerate my large program with GPU, I write an analogous but shorter code to take a test. In fact, it is just a three-fold(or six) loops, and every fold is an iteration and has a reduction which will be used next outer loop. And I rewrite like this,

``````	  ffkk=0.0

!\$acc region
do 30 ik=1,nm
do 30 iky=1,nm

ffqq=0.0
do 201 ip=1,nm
do 201 ipy=1,nm

ffqq1=0.0
do 10 iq=1,nm
do 10 iqy=1,nm
ffq=1.0/nm/nm
ffqq1=ffqq1+ffq
ffqq(ip,ipy)=ffqq1/2.0
10			continue
201		continue

ffpp=0.0
do 20 ip=1,nm
do 20 ipy=1,nm
ffp=ffqq(ip,ipy)/nm/nm
ffpp=ffpp+ffp
20		continue

ffk=ffpp/nm/nm
ffkk=ffkk+ffk
30	  continue
!\$acc end region
``````

The result should be correct, but I think the parallelism is not very good, because of so many ‘Loop is parallelizable’ as follows.

``````prog:
31, Generating copyout(ffqq(1:20,1:20))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
32, Parallelization would require privatization of array 'ffqq(1:20,i3+1)'
33, Parallelization would require privatization of array 'ffqq(1:20,i3+1)'
Accelerator kernel generated
32, !\$acc do seq
33, !\$acc do seq
CC 1.0 : 11 registers; 152 shared, 20 constant, 0 local memory bytes; 33% occupancy
CC 2.0 : 20 registers; 128 shared, 48 constant, 0 local memory bytes; 16% occupancy
56, Sum reduction generated for ffkk
35, Loop is parallelizable
36, Loop is parallelizable
37, Loop is parallelizable
40, Loop carried scalar dependence for 'ffqq1' at line 43
Loop carried reuse of 'ffqq' prevents parallelization
41, Loop carried scalar dependence for 'ffqq1' at line 43
Loop carried reuse of 'ffqq' prevents parallelization
Inner sequential loop scheduled on accelerator
49, Loop is parallelizable
50, Loop is parallelizable
``````

You have some issues that needs to be address in order for them to actually be parallelized on the GPU.

40, Loop carried scalar dependence for ‘ffqq1’ at line 43
Loop carried reuse of ‘ffqq’ prevents parallelization
41, Loop carried scalar dependence for ‘ffqq1’ at line 43
Loop carried reuse of ‘ffqq’ prevents parallelization
Inner sequential loop scheduled on accelerator

Hi herohzy,

Sarom is correct in that ffqq is preventing parallelization and your code is running sequentially on the device. While you know that ffqq is just a temp array used to hold intermediate values, the compiler can not assume this. To fix, you need to use the “private” clause to tell the compiler each thread has it’s own private copy of the array.

For example:

``````     ffkk=0.0

!\$acc region
!\$acc do parallel, vector(16)
do 30 ik=1,nm
!\$acc do kernel, parallel, vector(16), private(ffqq)
do 30 iky=1,nm

ffqq=0.0
do 201 ip=1,nm
do 201 ipy=1,nm

ffqq1=0.0
do 10 iq=1,nm
do 10 iqy=1,nm
ffq=1.0/nm/nm
ffqq1=ffqq1+ffq
ffqq(ip,ipy)=ffqq1/2.0
10         continue
201      continue

ffpp=0.0
do 20 ip=1,nm
do 20 ipy=1,nm
ffp=ffqq(ip,ipy)/nm/nm
ffpp=ffpp+ffp
20      continue

ffk=ffpp/nm/nm
ffkk=ffkk+ffk
30     continue
!\$acc end region
``````

Note that I also added the “kernel” clause to tell the compiler to only parallelize the “30” loops.

• Mat

Hi Mat and Sarom,

And I rebuild my program with your sugestion, but the statement

``````ffqq1=ffqq1+ffq
``````

was told

``````Vector was used where scalar was requested
``````

Then I modify my code like this,

``````     ffkk=0.0

!\$acc region
!\$acc do parallel, vector(16)
do 30 ik=1,nm
!\$acc do kernel, parallel, vector(16),private(ffqq),private(ffq)
do 30 iky=1,nm

ffqq=0.0
do 201 ip=1,nm
do 201 ipy=1,nm

ffq=0.0
do 10 iq=1,nm
do 10 iqy=1,nm
ffq(iq,iqy)=1.0/nm/nm
!ffqq1=ffqq1+ffq
!ffqq(ip,ipy)=ffqq1/2.0
10       continue
ffqq(ip,ipy)=sum(ffq)/2.0

201      continue

ffpp=0.0
do 20 ip=1,nm
do 20 ipy=1,nm
ffp=ffqq(ip,ipy)/nm/nm
ffpp=ffpp+ffp
20      continue

ffk=ffpp/nm/nm
ffkk=ffkk+ffk
30     continue
!\$acc end region
``````

But still it doesn’t work very well

``````prog:
31, Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
33, Loop is parallelizable
35, Loop is parallelizable
Accelerator kernel generated
33, !\$acc do parallel, vector(16) ! blockidx%y threadidx%y
35, !\$acc do parallel, vector(16) ! blockidx%x threadidx%x
CC 1.0 : 32 registers; 1120 shared, 48 constant, 0 local memory bytes; 33% occupancy
CC 2.0 : 32 registers; 1032 shared, 108 constant, 0 local memory bytes; 66% occupancy
61, Sum reduction generated for ffkk
37, Loop is parallelizable
38, Loop carried reuse of 'ffq' prevents parallelization
39, Loop carried reuse of 'ffq' prevents parallelization
41, Loop is parallelizable
42, Loop is parallelizable
43, Loop is parallelizable
48, Loop is parallelizable
54, Loop is parallelizable
55, Loop is parallelizable
``````

I’m confused by the parallel, vector, private, local and so on in the loops. I wonder whether the reason is the loops is so many that the compiler fails parallelizing the code.
Thanks a lot for any more help!

Vector was used where scalar was requested

This looks like a neginfo message for the host code generation warning that vector instructions were used in a scalar context. It’s not relevant for your GPU version. To only see accelerator compiler feedback messages, use “-Minfo=accel”.

But still it doesn’t work very well

Doesn’t work well how? Performance, validation?

Your -Minfo messages look fine. The outer two loops are being parallelzied and your schedule (parallel, vector) is good. The occupancy is good at 66%, not great but good. Finally, the compiler is recognising the sum reduction. Note that you can ignore the “loop carried reuse” messages since these these just prevent the middle loops from parallelizing.

If you don’t think you are getting good performance, what is the profile information (-ta=nvidia,time) telling you? Do you not have enough parallelizism (see the number of grids and blocks)? Is data movement causing the issue? Is your device initialisation time high?

I’m confused by the parallel, vector, private, local and so on in the loops. I wonder whether the reason is the loops is so many that the compiler fails parallelizing the code.

Take a look at Michael Wolfe’s Understanding the CUDA Data Parallel Threading Model: A Primer. It very helpful on giving an understand of what’s going on on the device. “parallel” corresponds with a CUDA block, and a “vector” is a CUDA thread. So in your case:

33, !\$acc do parallel, vector(16) ! blockidx%y threadidx%y
35, !\$acc do parallel, vector(16) ! blockidx%x threadidx%x

Has the compiler creating a two-dimensional block with a 16x16 two dimensional thread block. The number of blocks will be dynamical set at runtime depending upon the loop bounds.

“private” means that each thread will get it’s own private copy of the variable. By default all scalars are private, while arrays are assumed to be shared by all threads. In your case, “ffq” is being used to hold intermediate values from the inner loop. If it were shared, all your threads would overwrite each other. Note that you could manually privatise it by adding two extra dimensions for the ik and iky loops. Note that private data is destroy at the end of the the compute region.

“local” means to allocate space for the variable but don’t do any data movement. Unlike “private”, the data is shared by all threads. So if you were to manually privatise “ffq”, then you would want to make it “local” to avoid copying it back to the host.

I wonder whether the reason is the loops is so many that the compiler fails parallelizing the code.

The PGI Accelerator Model currently only supports tightly nested loops, so the dependency on “ffq” will prevent the middle two loops from parallelizing. One of the things we are looking at is how to extend the Model to accommodate non-tightly nested loops but this is still in its early stages.

• Mat

If you don’t think you are getting good performance, what is the profile information (-ta=nvidia,time) telling you? Do you not have enough parallelizism (see the number of grids and blocks)? Is data movement causing the issue? Is your device initialisation time high?

I just set nm=5, then run to get this message,

``````call to cuMemAlloc returned error 2: Out of memory
CUDA driver version:4020

31:region entered 1 time
time(us): init=0
``````

The GPU I use is GeForce GTS 450, and is the memory of it not enough for my computation?

The GPU I use is GeForce GTS 450, and is the memory of it not enough for my computation?

It could be your device, it could be the program, I’m not sure. I would need a reproducing example to tell what’s wrong. Can you post one or send an example to PGI Customer Service (trs@pgroup.com) and ask them them to forward it to me?

• Mat

Hi Mat,
Here is the complete code of my test program,

``````      program prog
use accel_lib

implicit none
! Variables
integer::ik,iky,ip,ipy,iq,iqy
integer::nm

real,allocatable::ffq(:,:),ffqq(:,:)
real::ffp,ffpp
real::ffk,ffkk

real::ffqq1,ffq1
real::ffpc,ffppc
real::ffkc,ffkkc

integer::c0,c1,c2
! Body
nm=20
allocate(ffqq(nm,nm))
!call acc_init(acc_device_nvidia)

call system_clock(count=c0)

ffkk=0.0

!\$acc region
!\$acc do parallel, vector(16)
do 30 ik=1,nm
!\$acc do kernel, parallel, vector(16),private(ffqq),private(ffq)
do 30 iky=1,nm

ffqq=0.0
do 201 ip=1,nm
do 201 ipy=1,nm

ffq=0.0
do 10 iq=1,nm
do 10 iqy=1,nm
ffq(iq,iqy)=1.0/nm/nm
!ffqq1=ffqq1+ffq
!ffqq(ip,ipy)=ffqq1/2.0
10       continue
ffqq(ip,ipy)=sum(ffq)

201      continue

ffpp=0.0
do 20 ip=1,nm
do 20 ipy=1,nm
ffp=ffqq(ip,ipy)/nm/nm
ffpp=ffpp+ffp
20      continue

ffk=ffpp/nm/nm
ffkk=ffkk+ffk
30     continue
!\$acc end region

call system_clock(count=c1)
ffkk=ffkk
write(*,*)'ffkk',ffkk

!Check on the CPU
ffkkc=0.0
do ik=1,nm*nm

ffpp=0.0
do ip=1,nm*nm

ffqq1=0.0
do iq=1,nm*nm
ffq1=1.0/nm/nm
ffqq1=ffqq1+ffq1
enddo

ffpc=ffqq1/nm/nm
ffppc=ffppc+ffpc
enddo

ffkc=ffppc/nm/nm
ffkkc=ffkkc+ffkc
enddo
call system_clock(count=c2)

write(*,*)'ffkkc',ffkkc

write(*,*)c1-c0,'ms on GPU'
write(*,*)c2-c1,'ms on host'
end program prog
``````

I have tried many times, but still the same error in my last post turns up.

Hi, Mat
I have sent the source code of program that I really want to accelerate to the PGI Customer Service.
I’ll really appreciate you if you could spend your precious time to check it!

Thanks,
herohzy.

Hi herohzy,

The problem with the above code is that you forgot to allocate ffq. Hence, when the compiler goes to allocate ffq on the device, it’s using a bogus size. When using accelerator directives, be sure your program is correct on the host, otherwise you may waste a lot time chasing down what appear to be odd GPU issues, but are really basic issues with your code.

I’ll get your full program from customer support to is if it’s the same problem there.

• Mat
``````% diff test_org.f90 test.f90
22c22
<
---
>      allocate(ffq(nm,nm))
% pgf90 test.f90 -ta=nvidia -Minfo=accel
prog:
29, Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
31, Loop is parallelizable
33, Loop is parallelizable
Accelerator kernel generated
31, !\$acc do parallel, vector(16) ! blockidx%y threadidx%y
33, !\$acc do parallel, vector(16) ! blockidx%x threadidx%x
CC 1.0 : 28 registers; 1112 shared, 44 constant, 0 local memory bytes; 33% occupancy
CC 2.0 : 59 registers; 1040 shared, 100 constant, 0 local memory bytes; 33% occupancy
59, Sum reduction generated for ffkk
35, Loop is parallelizable
36, Loop carried reuse of 'ffq' prevents parallelization
37, Loop carried reuse of 'ffq' prevents parallelization
39, Loop is parallelizable
40, Loop is parallelizable
41, Loop is parallelizable
46, Loop is parallelizable
52, Loop is parallelizable
53, Loop is parallelizable
% a.out
ffkk   0.9999990
ffkkc    200.6242
5475039 ms on GPU
63251 ms on host
``````

Hi herohzy,

I took a look at your code but there weren’t any accelerator directives in it so I was not able to reproduce your error. In looking at the loop that you want to accelerate, I see a number of issues.

First, unlike the sample code, your loops are triangular, i.e. the inner loop bounds is determined from the outer loop index. The GPU uses a rectangular grid to launch kernels so the compiler will have to translate your code to be rectangular and them put an if statement in to ignore the upper triangle. Something like:

``````      do 20 ikx=0,n2m
!       do 20 iky=0,ikx
do 20 iky=0,n2m
if (iky .le. ikx) then
c---------------------
ffqq1=0.d0
...
``````

Also, “n2m” is fairly small, 40, so you don’t have a lot of parallelism. Ideally, you want loop sizes in the thousands.

While I’ve only looked at the code a few minutes, my initial assessment is that it’s better left as an OpenMP code, rather then accelerating it. Not that it can’t put on a GPU, but in it’s current form, only the outermost loop is parallizable, so you don’t have enough parallelism to see speed-up.

Having said that, it may be possible to rearrange the algorithm to make it better suited for a GPU. Though I will leave that to you.

• Mat

Hello, Mat,
Thanks a lot for your kind help! In the sevral previous post, the code can run smoothly after allocating ffq. But there’s still a problem, that is , when I set nm=50 or much greater, it just get into errors. Related to memory?

Moreover, I have sent you, again through PGI Customer Service forwarding, some programs including the acc program sent last time and the other a CUDA Fortran program I writed these days. And the problems was described in the e-mail.

Hi herohzy,

When I increase nm to 50, the code ran without error on my system, albeit very slowly. Though, my device has 6GB of memory, so yes it does appear that you simply don’t have enough memory to run this program. How much memory does your device have? (See the output from the ‘pgaccelinfo’ utility).

``````c\$acc region
c\$acc do parallel,vector(16)
do 19 ikx=0,n2m
c\$acc do parallel,vector(16),private(ffpp),private(ffp)
do 19 iky=0,n2m
``````

One thing to keep in mind, when you privatize an array, every thread gets it’s own copy. So, in the above code, the two private arrays are 401x401. So at a minimum, there will be n2m times n2m number of threads (there will most likely be more since there are 16x16 threads in a block). So when nm=40, n2m=20. This means that 512 threads will be created (2 blocks of 16x16). When nm=50, 768 threads are created (3 blocks).

How much memory do each of the two private arrays require?

512 threads * 401 * 401 * 8 * 2 = 1.2 GB
768 threads * 401 * 401 * 8 * 2 = 1.8 GB

I’m guessing you only have 2GB of memory?

You can only fix this by reducing the number of threads or decreasing the size of the private arrays. However, your code only uses very few threads as it is, so reducing the number of threads will only hurt your performance even more.

It looks to me that the size of ffpp and fpp is just some fixed max value and only a n2m by n2m section is actually used. Can you update your code to dynamically allocate your arrays to just use the necessary amount of memory?

Here’s another section of code with issues.

``````c\$acc region
!c\$acc do private(ffkk,ffk)
do 20 ikx=0,n2m
do 20 iky=0,ikx
c---------------------------------
if(wk(ikx,iky).gt.0.d0)  then

... cut

c--------------------------
do 75 i=1,13
ffkk(i)=ffkk(i)+ffk(i)*aa
75    continue
c--------------------------
20    continue
c\$acc end region

do 80 i=1,13
ffkk(i)=2.d0*ffkk(i)*4.d0/nm2
80    continue
``````

Now this code wont accelerate either due to the loop dependency on ffkk and fkk. Looks like you thought about privatizing them but commented it out. Here, you don’t want to privatize ffkk since it’s values are used outside of the compute region and the contents of private arrays are destroyed at the end of the compute region.

What you’d really like here is that compiler can recognise the array reductions. Unfortunately the OpenACC standard and PGI Accelerator Model only support scalar reductions. So one fix might be to change ffkk to be a 13 scalars instead of an array.

I’m also thinking about your performance. The directives are designed to work on tightly nested loops. So when you have the loop structure of:

``````      do 19 ikx=0,n2m
do 19 iky=0,n2m

do 30 iqx=0,n2m
do 30 iqy=0,iqx

do 40 ipx=0,n2m
do 40 ipy=0,ipx

40 continue
30 continue

do 31 iqx=0,n2m
do 31 iqy=0,iqx

31      continue

... scalar code

19 continue
``````

Only the outer two loops can be accelerated. Since n2m is small, you wont get much parallelism and therefore poor performance. However, if you can rearrange your code to something like:

``````      do 19 ikx=0,n2m
do 19 iky=0,n2m
do 30 iqx=0,n2m
do 30 iqy=0,iqx
do 40 ipx=0,n2m
do 40 ipy=0,ipx
40 continue
30 continue
19 continue

do 29 ikx=0,n2m
do 29 iky=0,n2m
do 31 iqx=0,n2m
do 31 iqy=0,iqx
31 continue
29 continue

do 39 ikx=0,n2m
do 39 iky=0,n2m

... scalar code

39 continue
``````

Of course, this means adding extra dimensions to your arrays so that dependent information can be passed between the sets of loops. But I if it works, then you’ll have three arrays, a 6-D, 4-D, and 2-D, that have a lot more exposed parallization. Granted, I haven’t modified your code to see it will actually work, but it might be worth trying.

I’ll take a look at your CUDA Fortran code next.

• Mat

Hi herohzy,

For your first issue, you said that when you access several of your argument arrays from a kernel, the program errors. For example: “wkq=wk(iqx,iqy)”. One thing that would help you is to compile the code with “-Mcuda=emu -g” and run it through the PGI debugger, pgdbg. What is the value of iqx? For the first thread, it’s zero and your thread is accessing memory beyond the bounds of the arrays causing the kernel to abort

For the second error you said you had a problem if you uncommented an if statement. I’m assuming you mean the following:

``````        if(vl1k.GT.0.0)then
gaprk1=-vl2k/sqrt(vl1k) !2*gaprk
else
gaprk1=0.0
endif
gapr=gapr+gaprk1*cc
``````

Now gapr is a scalar dummy argument. I’m assuming that you mean for this to perform a sum reduction across all threads? Unfortunately, this is wont work. Instead you need to add in a parallel sum reduction. I show an example of this in my article on Multi-GPU programming http://www.pgroup.com/lit/articles/insider/v3n3a2.htm.

• Mat

Hi, Mat,
Yes, for my CUDA program, when I compiled the code with “-Mcuda=emu -g”, it could ran normally, including accessing the arguments wk(:,:),bw(:,:), snb(:,:), and etc., and even gave a good-looking result(not correct yet). But how can I let it run normally on the GPU? Need I updata my GPU?(Now I use GTS450)
Thanks a million.

Hi herohzy,

When running on the CPU (i.e. in emulation) “off-by-one” and “out-bounds” errors generally wont cause your program to fail. However, on the GPU it will. So if it works on the CPU and fails on the GPU, this is typically the cause.

Need I updata my GPU?

That’s one possibility but I would try to reorganise the algorithm so it doesn’t need as much memory.

• Mat