PGI Accelerator programming concepts questions

Hello,
the tutorials provided are very handy but I would have some questions that could confirm that I understand this programming model properly:

  1. When I explicitly create a data region with copyins/outs/locals, then these clauses, no matter how many acc regions (computing) I create within this data region, are handled only at the beginning and at the end of the data region, not at each computing region?
  2. Can I nest data regions within themselves?
  3. IF/SWITCH clauses have to be avoided only within computing ACC regions, I can have them normally in data regions?
  4. Can I put a subroutine call within a data region? (that subroutines has then computing regions within)?
  5. Could you please provide a short code tutoring me on the update clause?

And some questions concerning the PGI Accelerator environment:

  1. I’m deploying the accelerated software in Fortran90 on a Tesla rack server (4xGT200) and enabling multi-GPU via MPI.
    a) the ACC_NOTIFY shows me only kernel launch info from the process with rank 0 even though 4 separate GPUs are utilised, (1MPI@1core+1GPU), can I see all 4 information
    b) when can I expect support for PGI accelerator within OpenMP regions?
  2. How should I use the pgi_accinit tool? Running it background is enough?
  3. When I’m compiling the software with static common blocks greater than 2GB I get some compiler errors (even withouth Acceleration), introducing --mcmodel=medium helps for the errors within my software, but still those errors occur on some hpf libraries from PGI compilers directory… (Linux x86_64 Fedora 11)

Thank You in advance for your replies.

  1. When I explicitly create a data region with copyins/outs/locals, then these clauses, no matter how many acc regions (computing) I create within this data region, are handled only at the beginning and at the end of the data region, not at each computing region?

Correct. The exception being when you use the update directive.

  1. Can I nest data regions within themselves?

Yes.

  1. IF/SWITCH clauses have to be avoided only within computing ACC regions, I can have them normally in data regions?

No, the if clause only applies to compute regions.

  1. Can I put a subroutine call within a data region?

In host code, yes. In an acc compute region, no.

(that subroutines has then computing regions within)?

Soon. The PGI 2011 (aka 11.0) release in November will allow data regions to span across subroutine calls using the ‘reflected’ directive. Note that is will be a Fortran only feature.

  1. Could you please provide a short code tutoring me on the update clause?


% cat update.f90


program foo

   real, dimension(1024) :: A, B
   integer i
   A = 1.0

!$acc data region copy(A), local(B)
!$acc region
   do i=1,1024
     B(i) = A(i) / 2
     A(i) = A(i) * i
   end do
!$acc end region

! update the host copy of A and print the intermediary values
!$acc update host(A)
   print *, A(1), A(1024)

!$acc region
   do i=1,1024
     A(i) = B(i) * 2
   end do
!$acc end region

!$acc end data region

   print *, A(1), A(1024)

   end program foo

% pgf90 update.f90 -V10.9 -ta=nvidia -Minfo=accel
foo:
      9, Generating local(b(:))
         Generating copy(a(:))
     10, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     11, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc do parallel, vector(256)
             Using register for 'a'
             CC 1.0 : 6 registers; 20 shared, 24 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 6 registers; 20 shared, 24 constant, 0 local memory bytes; 100 occupancy
     17, Generating !$acc update host(a(:))
     20, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     21, Loop is parallelizable
         Accelerator kernel generated
         21, !$acc do parallel, vector(256)
             CC 1.0 : 3 registers; 20 shared, 24 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 3 registers; 20 shared, 24 constant, 0 local memory bytes; 100 occupancy
% a.out
    1.000000        1024.000
    1.000000        1.000000

Hope this helps,
Mat

Hello,
thank You very much for such fast and comprehensive reply.

Quote:
3) IF/SWITCH clauses have to be avoided only within computing ACC regions, I can have them normally in data regions?

No, the if clause only applies to compute regions.

My mistake, I meant classical IF/SWITCH fortran statements, not the PGI Acc clauses.

Could You also answer to the environmental questions?
I also found a thread about Accelerator regions with OpenMP regions, is this feature also already available or when can we expect it?
What are the new features we can expect in the upcoming releases?
Besides OpenMP is there a possibility for the Accelerator Programming Model to act likce OpenCL -> empower heterogenous architectures (I am aware of the Unified Binary Technology) but it won’t automatically deploy on a computing cluster node to balance all the computations on CPUs+GPUs where number of CPU cores >> GPUs.

My mistake, I meant classical IF/SWITCH fortran statements, not the PGI Acc clauses.

No problem. You can have any Fortran statements in the host code that is within the data region.

I also found a thread about Accelerator regions with OpenMP regions, is this feature also already available or when can we expect it?

You can use accelerator regions within an OpenMP parallel region. The only caveat is that you can’t use both on the same for loop. The basic outline is:

!$omp parallel

threadid = omp_get_thread_num()

! set your device
call acc_set_device_num(threadid, acc_device_nvidia)

! start your accelerator region or call a routine that contains the acc directives.
!$acc region 
... etc.

On a side note. You can also use acc directives within MPI code or even hybrid MPI/OpenMP code.

What are the new features we can expect in the upcoming releases?

Support for the reflected and mirror clauses will be available in the 11.0 release.

Besides OpenMP is there a possibility for the Accelerator Programming Model to act likce OpenCL -> empower heterogenous architectures (I am aware of the Unified Binary Technology) but it won’t automatically deploy on a computing cluster node to balance all the computations on CPUs+GPUs where number of CPU cores >> GPUs.

I assume that you mean you would like that the accelerator thread be split across both the GPU and CPU, not just the either/or support found with Unified Binary?

The short answer is no.

I don’t know OpenCL myself but don’t see how this could be done effectively in an automatic and general way. The lack of a unified memory is major problem and load balancing would be algorithm dependent. Now you could certain do this yourself, for example one OpenMP thread runs on the GPU and another on the CPU, but the compiler simply doesn’t have enough information to make good choices to do this automatically.

Hope this helps,
Mat

Thanks, now I understand the full idea of OpenMP+MPI+PGIAcc, though one more tiny question concerning what’s below:

Would code like this work:

!acc data region
!acc. copyin(a,b,c,d)
!acc local(e,f)

...

!acc region
do
...
end do

!acc end region
...

call subfunct1(a,c,e)

...


!acc region
do
...
end do

!acc end region
...


call subfunct2(a,b,d,e,f)

!acc end data region

Subfunct1 and subfunct2 would also have !acc regions but without an explicit data region and I want to avoid extra H2D and D2H data transfers.
Would the compiler automatically remap these variables within these subrroutines and inline them or do I have to pull the raw code from these subfunctions and put it here instead of the calls?

Thanks again in advance for your support,
Nicolas

Hi Nicolas,

Would code like this work:

Yes so long as you use the ‘reflected’ statement within subfunct1 and subfunct2 to tell the compiler to check if the arrays have already been allocated on the device.

Note that if you don’t want to wait until the 11.0 release, on Linux the PGI ACC model can be mixed with CUDA Fortran. So instead of using a data region, you can use CUDA Fortran device arrays. It means that your program is no longer portable and that you’d want to back these changes out once 11.0 is available, but is less work than inlining your subroutines and allows you to continues development.

  • Mat

Ok, so yet either CUDA Fortran or manual inlining :) Thank You for support.

Hello there,
where should the !$acc reflected statement be located in the subroutines?
Just under the routine definition or at the beginning of the body after all variables/common blocks, etc. lines?

The reflected directive should be added after the variable is declared.

Note the reflected will be available in the 11.0 release due this December.

  • Mat

Ok, could You provide an example please :)
Yet in December? It is already documented in the Programming Model Guide :)
Any ETA when exactly in December will You release the 11th compiler suite?

Hi A,

Here’s simple example of using the reflected directive:

% cat refected.f90 

module mm
contains
 subroutine sub1( a, b, c )
  implicit none
  real :: a(:,:), b(:,:), c(:,:)
  !$acc reflected(a)
  integer :: i,j
  !$acc region
   do j = 1,ubound(a,2)
    do i = 1,ubound(a,1)
     a(i,j) = b(i,j) + c(i,j)
    enddo
   enddo
  !$acc end region 
 end subroutine
end module

program p
 use mm
 use accel_lib
 implicit none
 integer, parameter :: n=32,m=32
 real :: a(n,m), b(n,m), c(n,m)
 integer :: i,j
 do j = 1,m
  do i = 1,n
   a(i,j) = -1.0
   b(i,j) = (j*100) + i
   c(i,j) = -(j*100) + i
  enddo
 enddo

 !$acc data region copyout(a)
  call sub1(a,b,c)
 !$acc end data region

  print *, a(1,1), a(n,m) 
  print *, b(1,1), b(n,m) 
  print *, c(1,1), c(n,m) 
end program
% pgf90 -ta=nvidia -Minfo=accel refected.f90 -V11.0 ; a.out
sub1:
      7, Generating local(a(:,:))
      9, Generating copyin(b(1:z_b_0,1:z_b_3))
         Generating copyin(c(1:z_b_0,1:z_b_3))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     10, Loop is parallelizable
     11, Loop is parallelizable
         Accelerator kernel generated
         10, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         11, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             CC 1.0 : 7 registers; 64 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 8 registers; 64 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 15 registers; 8 shared, 72 constant, 0 local memory bytes; 100% occupancy
p:
     34, Generating copyout(a(:,:))
    2.000000        64.00000    
    101.0000        3232.000    
   -99.00000       -3168.000



Yet in December? It is already documented in the Programming Model Guide :)

The Model is ahead of implementation though in 11.0 we will have the full PGI 1.2 Accelerator Model fully implemented. Still more work to do since the spec for 1.3 Model just cam out as well.

Any ETA when exactly in December will You release the 11th compiler suite?

Right now we’re finishing up a few last minute fixes. Barring any show stopping errors, we’re expecting 11.0 to be available mid month.

  • Mat