Six Loops iteration and reduction

I’m a beginner of PVF. To accelerate my large program with GPU, I write an analogous but shorter code to take a test. In fact, it is just a three-fold(or six) loops, and every fold is an iteration and has a reduction which will be used next outer loop. And I rewrite like this,

	  ffkk=0.0

	  !$acc region
	  do 30 ik=1,nm
	  do 30 iky=1,nm
		
		ffqq=0.0
		do 201 ip=1,nm
		do 201 ipy=1,nm
			
			ffqq1=0.0
			do 10 iq=1,nm
			do 10 iqy=1,nm
				ffq=1.0/nm/nm
				ffqq1=ffqq1+ffq
				ffqq(ip,ipy)=ffqq1/2.0
10			continue
201		continue

		ffpp=0.0
		do 20 ip=1,nm
		do 20 ipy=1,nm
			ffp=ffqq(ip,ipy)/nm/nm
			ffpp=ffpp+ffp
20		continue

		ffk=ffpp/nm/nm
		ffkk=ffkk+ffk
30	  continue
	  !$acc end region

The result should be correct, but I think the parallelism is not very good, because of so many ‘Loop is parallelizable’ as follows.

prog:
     31, Generating copyout(ffqq(1:20,1:20))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     32, Parallelization would require privatization of array 'ffqq(1:20,i3+1)'
     33, Parallelization would require privatization of array 'ffqq(1:20,i3+1)'
         Accelerator kernel generated
         32, !$acc do seq
         33, !$acc do seq
             CC 1.0 : 11 registers; 152 shared, 20 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 20 registers; 128 shared, 48 constant, 0 local memory bytes; 16% occupancy
         56, Sum reduction generated for ffkk
     35, Loop is parallelizable
     36, Loop is parallelizable
     37, Loop is parallelizable
     40, Loop carried scalar dependence for 'ffqq1' at line 43
         Loop carried reuse of 'ffqq' prevents parallelization
     41, Loop carried scalar dependence for 'ffqq1' at line 43
         Loop carried reuse of 'ffqq' prevents parallelization
         Inner sequential loop scheduled on accelerator
     49, Loop is parallelizable
     50, Loop is parallelizable

I am looking forward to your helpful derectives.

You have some issues that needs to be address in order for them to actually be parallelized on the GPU.

40, Loop carried scalar dependence for ‘ffqq1’ at line 43
Loop carried reuse of ‘ffqq’ prevents parallelization
41, Loop carried scalar dependence for ‘ffqq1’ at line 43
Loop carried reuse of ‘ffqq’ prevents parallelization
Inner sequential loop scheduled on accelerator

Hi herohzy,

Sarom is correct in that ffqq is preventing parallelization and your code is running sequentially on the device. While you know that ffqq is just a temp array used to hold intermediate values, the compiler can not assume this. To fix, you need to use the “private” clause to tell the compiler each thread has it’s own private copy of the array.

For example:

     ffkk=0.0

     !$acc region
!$acc do parallel, vector(16)
     do 30 ik=1,nm
!$acc do kernel, parallel, vector(16), private(ffqq)
     do 30 iky=1,nm
      
      ffqq=0.0
      do 201 ip=1,nm
      do 201 ipy=1,nm
         
         ffqq1=0.0
         do 10 iq=1,nm
         do 10 iqy=1,nm
            ffq=1.0/nm/nm
            ffqq1=ffqq1+ffq
            ffqq(ip,ipy)=ffqq1/2.0
10         continue
201      continue

      ffpp=0.0
      do 20 ip=1,nm
      do 20 ipy=1,nm
         ffp=ffqq(ip,ipy)/nm/nm
         ffpp=ffpp+ffp
20      continue

      ffk=ffpp/nm/nm
      ffkk=ffkk+ffk
30     continue
     !$acc end region

Note that I also added the “kernel” clause to tell the compiler to only parallelize the “30” loops.

  • Mat

Hi Mat and Sarom,
Very glad for your helpful reminding as well as Mat’s particular modification.

And I rebuild my program with your sugestion, but the statement

ffqq1=ffqq1+ffq

was told

Vector was used where scalar was requested

Then I modify my code like this,

     ffkk=0.0 

     !$acc region 
!$acc do parallel, vector(16) 
     do 30 ik=1,nm 
!$acc do kernel, parallel, vector(16),private(ffqq),private(ffq)
     do 30 iky=1,nm 
      
      ffqq=0.0 
      do 201 ip=1,nm 
      do 201 ipy=1,nm 
          
         ffq=0.0 
         do 10 iq=1,nm 
         do 10 iqy=1,nm 
            ffq(iq,iqy)=1.0/nm/nm 
            !ffqq1=ffqq1+ffq
            !ffqq(ip,ipy)=ffqq1/2.0 
10       continue
		ffqq(ip,ipy)=sum(ffq)/2.0


201      continue 

      ffpp=0.0 
      do 20 ip=1,nm 
      do 20 ipy=1,nm 
         ffp=ffqq(ip,ipy)/nm/nm 
         ffpp=ffpp+ffp 
20      continue 

      ffk=ffpp/nm/nm 
      ffkk=ffkk+ffk 
30     continue 
     !$acc end region

But still it doesn’t work very well

prog:
     31, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     33, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         33, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         35, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             CC 1.0 : 32 registers; 1120 shared, 48 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 32 registers; 1032 shared, 108 constant, 0 local memory bytes; 66% occupancy
         61, Sum reduction generated for ffkk
     37, Loop is parallelizable
     38, Loop carried reuse of 'ffq' prevents parallelization
     39, Loop carried reuse of 'ffq' prevents parallelization
     41, Loop is parallelizable
     42, Loop is parallelizable
     43, Loop is parallelizable
     48, Loop is parallelizable
     54, Loop is parallelizable
     55, Loop is parallelizable

I’m confused by the parallel, vector, private, local and so on in the loops. I wonder whether the reason is the loops is so many that the compiler fails parallelizing the code.
Thanks a lot for any more help!

Vector was used where scalar was requested

This looks like a neginfo message for the host code generation warning that vector instructions were used in a scalar context. It’s not relevant for your GPU version. To only see accelerator compiler feedback messages, use “-Minfo=accel”.

But still it doesn’t work very well

Doesn’t work well how? Performance, validation?

Your -Minfo messages look fine. The outer two loops are being parallelzied and your schedule (parallel, vector) is good. The occupancy is good at 66%, not great but good. Finally, the compiler is recognising the sum reduction. Note that you can ignore the “loop carried reuse” messages since these these just prevent the middle loops from parallelizing.

If you don’t think you are getting good performance, what is the profile information (-ta=nvidia,time) telling you? Do you not have enough parallelizism (see the number of grids and blocks)? Is data movement causing the issue? Is your device initialisation time high?

I’m confused by the parallel, vector, private, local and so on in the loops. I wonder whether the reason is the loops is so many that the compiler fails parallelizing the code.

Take a look at Michael Wolfe’s Understanding the CUDA Data Parallel Threading Model: A Primer. It very helpful on giving an understand of what’s going on on the device. “parallel” corresponds with a CUDA block, and a “vector” is a CUDA thread. So in your case:

33, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
35, !$acc do parallel, vector(16) ! blockidx%x threadidx%x

Has the compiler creating a two-dimensional block with a 16x16 two dimensional thread block. The number of blocks will be dynamical set at runtime depending upon the loop bounds.

“private” means that each thread will get it’s own private copy of the variable. By default all scalars are private, while arrays are assumed to be shared by all threads. In your case, “ffq” is being used to hold intermediate values from the inner loop. If it were shared, all your threads would overwrite each other. Note that you could manually privatise it by adding two extra dimensions for the ik and iky loops. Note that private data is destroy at the end of the the compute region.

“local” means to allocate space for the variable but don’t do any data movement. Unlike “private”, the data is shared by all threads. So if you were to manually privatise “ffq”, then you would want to make it “local” to avoid copying it back to the host.

I wonder whether the reason is the loops is so many that the compiler fails parallelizing the code.

The PGI Accelerator Model currently only supports tightly nested loops, so the dependency on “ffq” will prevent the middle two loops from parallelizing. One of the things we are looking at is how to extend the Model to accommodate non-tightly nested loops but this is still in its early stages.

  • Mat

If you don’t think you are getting good performance, what is the profile information (-ta=nvidia,time) telling you? Do you not have enough parallelizism (see the number of grids and blocks)? Is data movement causing the issue? Is your device initialisation time high?

I just set nm=5, then run to get this message,

call to cuMemAlloc returned error 2: Out of memory
CUDA driver version:4020

    31:region entered 1 time
         time(us): init=0

The GPU I use is GeForce GTS 450, and is the memory of it not enough for my computation?

The GPU I use is GeForce GTS 450, and is the memory of it not enough for my computation?

It could be your device, it could be the program, I’m not sure. I would need a reproducing example to tell what’s wrong. Can you post one or send an example to PGI Customer Service (trs@pgroup.com) and ask them them to forward it to me?

  • Mat

Hi Mat,
Thanks for your patient help!
Here is the complete code of my test program,

      program prog
	  use accel_lib

      implicit none
      ! Variables
	  integer::ik,iky,ip,ipy,iq,iqy
	  integer::nm

	  real,allocatable::ffq(:,:),ffqq(:,:)
	  real::ffp,ffpp
	  real::ffk,ffkk

	  real::ffqq1,ffq1
	  real::ffpc,ffppc
	  real::ffkc,ffkkc

	  integer::c0,c1,c2
      ! Body
	  nm=20
	  allocate(ffqq(nm,nm))
	  !call acc_init(acc_device_nvidia) 

	  call system_clock(count=c0)

     ffkk=0.0 

     !$acc region 
!$acc do parallel, vector(16) 
     do 30 ik=1,nm 
!$acc do kernel, parallel, vector(16),private(ffqq),private(ffq)
     do 30 iky=1,nm 
      
      ffqq=0.0 
      do 201 ip=1,nm 
      do 201 ipy=1,nm 
          
         ffq=0.0 
         do 10 iq=1,nm 
         do 10 iqy=1,nm 
            ffq(iq,iqy)=1.0/nm/nm 
            !ffqq1=ffqq1+ffq
            !ffqq(ip,ipy)=ffqq1/2.0 
10       continue
		ffqq(ip,ipy)=sum(ffq)


201      continue 

      ffpp=0.0 
      do 20 ip=1,nm 
      do 20 ipy=1,nm 
         ffp=ffqq(ip,ipy)/nm/nm 
         ffpp=ffpp+ffp 
20      continue 

      ffk=ffpp/nm/nm 
      ffkk=ffkk+ffk 
30     continue 
     !$acc end region 

	  call system_clock(count=c1)
	  ffkk=ffkk
	  write(*,*)'ffkk',ffkk
	
	!Check on the CPU	
	  ffkkc=0.0
	  do ik=1,nm*nm

		ffpp=0.0
		do ip=1,nm*nm
			
			ffqq1=0.0
			do iq=1,nm*nm
				ffq1=1.0/nm/nm
				ffqq1=ffqq1+ffq1
			enddo

			ffpc=ffqq1/nm/nm
			ffppc=ffppc+ffpc
		enddo

		ffkc=ffppc/nm/nm
		ffkkc=ffkkc+ffkc
	  enddo
	  call system_clock(count=c2)

	  write(*,*)'ffkkc',ffkkc

	  write(*,*)c1-c0,'ms on GPU'
	  write(*,*)c2-c1,'ms on host'
      end program prog

I have tried many times, but still the same error in my last post turns up.

Hi, Mat
I have sent the source code of program that I really want to accelerate to the PGI Customer Service.
I’ll really appreciate you if you could spend your precious time to check it!

Thanks,
herohzy.

Hi herohzy,

The problem with the above code is that you forgot to allocate ffq. Hence, when the compiler goes to allocate ffq on the device, it’s using a bogus size. When using accelerator directives, be sure your program is correct on the host, otherwise you may waste a lot time chasing down what appear to be odd GPU issues, but are really basic issues with your code.

I’ll get your full program from customer support to is if it’s the same problem there.

  • Mat
% diff test_org.f90 test.f90
22c22
<
---
>      allocate(ffq(nm,nm))
% pgf90 test.f90 -ta=nvidia -Minfo=accel
prog:
     29, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     31, Loop is parallelizable
     33, Loop is parallelizable
         Accelerator kernel generated
         31, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         33, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             CC 1.0 : 28 registers; 1112 shared, 44 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 59 registers; 1040 shared, 100 constant, 0 local memory bytes; 33% occupancy
         59, Sum reduction generated for ffkk
     35, Loop is parallelizable
     36, Loop carried reuse of 'ffq' prevents parallelization
     37, Loop carried reuse of 'ffq' prevents parallelization
     39, Loop is parallelizable
     40, Loop is parallelizable
     41, Loop is parallelizable
     46, Loop is parallelizable
     52, Loop is parallelizable
     53, Loop is parallelizable
% a.out
 ffkk   0.9999990
 ffkkc    200.6242
      5475039 ms on GPU
        63251 ms on host

Hi herohzy,

I took a look at your code but there weren’t any accelerator directives in it so I was not able to reproduce your error. In looking at the loop that you want to accelerate, I see a number of issues.

First, unlike the sample code, your loops are triangular, i.e. the inner loop bounds is determined from the outer loop index. The GPU uses a rectangular grid to launch kernels so the compiler will have to translate your code to be rectangular and them put an if statement in to ignore the upper triangle. Something like:

      do 20 ikx=0,n2m
 !       do 20 iky=0,ikx
        do 20 iky=0,n2m
       if (iky .le. ikx) then
c---------------------
      ffqq1=0.d0
...

Also, “n2m” is fairly small, 40, so you don’t have a lot of parallelism. Ideally, you want loop sizes in the thousands.

While I’ve only looked at the code a few minutes, my initial assessment is that it’s better left as an OpenMP code, rather then accelerating it. Not that it can’t put on a GPU, but in it’s current form, only the outermost loop is parallizable, so you don’t have enough parallelism to see speed-up.

Having said that, it may be possible to rearrange the algorithm to make it better suited for a GPU. Though I will leave that to you.

  • Mat

Hello, Mat,
Thanks a lot for your kind help! In the sevral previous post, the code can run smoothly after allocating ffq. But there’s still a problem, that is , when I set nm=50 or much greater, it just get into errors. Related to memory?

Moreover, I have sent you, again through PGI Customer Service forwarding, some programs including the acc program sent last time and the other a CUDA Fortran program I writed these days. And the problems was described in the e-mail.

Hi herohzy,

When I increase nm to 50, the code ran without error on my system, albeit very slowly. Though, my device has 6GB of memory, so yes it does appear that you simply don’t have enough memory to run this program. How much memory does your device have? (See the output from the ‘pgaccelinfo’ utility).

c$acc region 
c$acc do parallel,vector(16)
      do 19 ikx=0,n2m
c$acc do parallel,vector(16),private(ffpp),private(ffp)
        do 19 iky=0,n2m

One thing to keep in mind, when you privatize an array, every thread gets it’s own copy. So, in the above code, the two private arrays are 401x401. So at a minimum, there will be n2m times n2m number of threads (there will most likely be more since there are 16x16 threads in a block). So when nm=40, n2m=20. This means that 512 threads will be created (2 blocks of 16x16). When nm=50, 768 threads are created (3 blocks).

How much memory do each of the two private arrays require?

512 threads * 401 * 401 * 8 * 2 = 1.2 GB
768 threads * 401 * 401 * 8 * 2 = 1.8 GB

I’m guessing you only have 2GB of memory?

You can only fix this by reducing the number of threads or decreasing the size of the private arrays. However, your code only uses very few threads as it is, so reducing the number of threads will only hurt your performance even more.

It looks to me that the size of ffpp and fpp is just some fixed max value and only a n2m by n2m section is actually used. Can you update your code to dynamically allocate your arrays to just use the necessary amount of memory?

Here’s another section of code with issues.

c$acc region
!c$acc do private(ffkk,ffk)
      do 20 ikx=0,n2m
        do 20 iky=0,ikx
c---------------------------------
      if(wk(ikx,iky).gt.0.d0)  then

... cut

c--------------------------
      do 75 i=1,13
        ffkk(i)=ffkk(i)+ffk(i)*aa
75    continue
c--------------------------
20    continue
c$acc end region

      do 80 i=1,13
        ffkk(i)=2.d0*ffkk(i)*4.d0/nm2
80    continue

Now this code wont accelerate either due to the loop dependency on ffkk and fkk. Looks like you thought about privatizing them but commented it out. Here, you don’t want to privatize ffkk since it’s values are used outside of the compute region and the contents of private arrays are destroyed at the end of the compute region.

What you’d really like here is that compiler can recognise the array reductions. Unfortunately the OpenACC standard and PGI Accelerator Model only support scalar reductions. So one fix might be to change ffkk to be a 13 scalars instead of an array.


I’m also thinking about your performance. The directives are designed to work on tightly nested loops. So when you have the loop structure of:

      do 19 ikx=0,n2m
        do 19 iky=0,n2m

          do 30 iqx=0,n2m
             do 30 iqy=0,iqx

                 do 40 ipx=0,n2m
                   do 40 ipy=0,ipx

      40 continue
      30 continue
               
           do 31 iqx=0,n2m
             do 31 iqy=0,iqx

       31      continue

     ... scalar code

    19 continue

Only the outer two loops can be accelerated. Since n2m is small, you wont get much parallelism and therefore poor performance. However, if you can rearrange your code to something like:

      do 19 ikx=0,n2m
        do 19 iky=0,n2m
          do 30 iqx=0,n2m
             do 30 iqy=0,iqx
                 do 40 ipx=0,n2m
                   do 40 ipy=0,ipx
      40 continue
      30 continue
      19 continue               

      do 29 ikx=0,n2m
        do 29 iky=0,n2m
           do 31 iqx=0,n2m
             do 31 iqy=0,iqx
       31 continue
       29 continue

      do 39 ikx=0,n2m
        do 39 iky=0,n2m

     ... scalar code

      39 continue

Of course, this means adding extra dimensions to your arrays so that dependent information can be passed between the sets of loops. But I if it works, then you’ll have three arrays, a 6-D, 4-D, and 2-D, that have a lot more exposed parallization. Granted, I haven’t modified your code to see it will actually work, but it might be worth trying.

I’ll take a look at your CUDA Fortran code next.

  • Mat

Hi herohzy,

For your first issue, you said that when you access several of your argument arrays from a kernel, the program errors. For example: “wkq=wk(iqx,iqy)”. One thing that would help you is to compile the code with “-Mcuda=emu -g” and run it through the PGI debugger, pgdbg. What is the value of iqx? For the first thread, it’s zero and your thread is accessing memory beyond the bounds of the arrays causing the kernel to abort

For the second error you said you had a problem if you uncommented an if statement. I’m assuming you mean the following:

        if(vl1k.GT.0.0)then
        gaprk1=-vl2k/sqrt(vl1k) !2*gaprk
        else
        gaprk1=0.0
        endif
        gapr=gapr+gaprk1*cc

Now gapr is a scalar dummy argument. I’m assuming that you mean for this to perform a sum reduction across all threads? Unfortunately, this is wont work. Instead you need to add in a parallel sum reduction. I show an example of this in my article on Multi-GPU programming http://www.pgroup.com/lit/articles/insider/v3n3a2.htm.

  • Mat

Hi, Mat,
Yes, for my CUDA program, when I compiled the code with “-Mcuda=emu -g”, it could ran normally, including accessing the arguments wk(:,:),bw(:,:), snb(:,:), and etc., and even gave a good-looking result(not correct yet). But how can I let it run normally on the GPU? Need I updata my GPU?(Now I use GTS450)
Thanks a million.

Hi herohzy,

When running on the CPU (i.e. in emulation) “off-by-one” and “out-bounds” errors generally wont cause your program to fail. However, on the GPU it will. So if it works on the CPU and fails on the GPU, this is typically the cause.

Need I updata my GPU?

That’s one possibility but I would try to reorganise the algorithm so it doesn’t need as much memory.

  • Mat