Sharing device data with subroutines and Fortran !$acc direc

Hi (again),

Is there any way using !$acc directives to share data already located on the GPU with Fortran subroutines?

When I compile the test code below with v10.6.0 of the pgf90 compiler, the (edited) output reads:

test_data_spans_subs:
     32, Generating copyin(u(:))
sub1:
     47, Generating copyin(u(1:100))
sub2:
     63, Generating copyin(u(1:100))

Line 32 is the data region in the main program. I would like to eliminate the repeat data copying in the subroutines (lines 47 and 63).

I don’t mind promoting u into MODULE my_data (and not passing it) but I couldn’t find how to make that work either.

Thanks for any advice you can offer,

Alistair.

!!$**************************************        
MODULE my_data
   IMPLICIT NONE
   INTEGER, PARAMETER :: N = 100
END MODULE my_data
!!$**************************************                                                  
PROGRAM test_data_spans_subs

   USE my_data
   IMPLICIT NONE

   INTEGER :: u(N)
   INTEGER :: i,r

   DO i = 1,N
    u(i) = i
   ENDDO

!!$ Attempt 1: inlined                                                                     
!$acc data region copyin(u)                                                                
!$acc region                                                                               
   r = SUM(u)
!$acc end region                                                                           
   PRINT *,"Total 1: ",r
!$acc region                                                                               
   r = SUM(u**2)
!$acc end region                                                                           
   PRINT *,"Total 2: ",r
!$acc end data region                                                                      

!!$ Attempt 2: subroutines                                                                 
!$acc data region copyin(u)                                                                
   CALL sub1(u)
   CALL sub2(u)
!$acc end data region                                                                      

END PROGRAM test_data_spans_subs
!!$**************************************                                                  
SUBROUTINE sub1(u)

   USE my_data
   IMPLICIT NONE
   INTEGER, INTENT(in) :: u(N)

   INTEGER :: r

!$acc region                                                                               
   r = SUM(u)
!$acc end region                                                                           

   PRINT *,"Total 1: ",r

END SUBROUTINE sub1
!!$**************************************                                                  
SUBROUTINE sub2(u)

   USE my_data
   IMPLICIT NONE
   INTEGER, INTENT(in) :: u(N)

   INTEGER :: r

!$acc region                                                                               
   r = SUM(u**2)
!$acc end region                                                                           

   PRINT *,"Total 2: ",r

END SUBROUTINE sub2
!!$**************************************

Hi Alistair,

What you need is the ‘reflected’ clause. (See section 2.6.6. of http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.2.pdf). Unfortunately, implementation of this feature was more complex than originally thought and has been delayed until 10.8/10.9 time frame.

In the meantime, you can use CUDA Fortran device pointers to pass the device data to routines. The caveat is that mixing CUDA Fortran and the PGI Accelerator is only supported on Linux and should be considered experimental.

Example:

% cat test.f90

#ifdef _ACCEL
#define CUDA_DEVICE ,device
#else
#define CUDA_DEVICE
#endif

!!$**************************************
MODULE my_data
   IMPLICIT NONE
   INTEGER, PARAMETER :: N = 100
END MODULE my_data
!!$**************************************
PROGRAM test_data_spans_subs

   USE my_data
   IMPLICIT NONE

   INTEGER CUDA_DEVICE :: u(N)
   INTEGER :: i,r

!$acc region
   DO i = 1,N
    u(i) = i
   ENDDO
!$acc end region

!!$ Attempt 1: inlined
!$acc region
   r = SUM(u)
!$acc end region
   PRINT *,"Total 1: ",r
!$acc region
   r = SUM(u**2)
!$acc end region
   PRINT *,"Total 2: ",r

!!$ Attempt 2: subroutines
   CALL sub1(u)
   CALL sub2(u)

END PROGRAM test_data_spans_subs
!!$**************************************
SUBROUTINE sub1(u)

   USE my_data
   IMPLICIT NONE
   INTEGER CUDA_DEVICE, INTENT(in) :: u(N)

   INTEGER :: r

!$acc region
   r = SUM(u)
!$acc end region

   PRINT *,"Total 1: ",r

END SUBROUTINE sub1
!!$**************************************
SUBROUTINE sub2(u)

   USE my_data
   IMPLICIT NONE
   INTEGER CUDA_DEVICE, INTENT(in) :: u(N)

   INTEGER :: r

!$acc region
   r = SUM(u**2)
!$acc end region

   PRINT *,"Total 2: ",r

END SUBROUTINE sub2
!!$**************************************

% pgf90 -ta=nvidia -Mcuda test.f90 -Minfo=accel -Mpreprocess -o test_gpu.out
test_data_spans_subs:
     22, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     23, Loop is parallelizable
         Accelerator kernel generated
         23, !$acc do parallel, vector(96)
             CC 1.0 : 4 registers; 20 shared, 20 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 4 registers; 20 shared, 20 constant, 0 local memory bytes; 75% occupancy
     29, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     30, Loop is parallelizable
         Accelerator kernel generated
         30, !$acc do parallel, vector(96)
             Sum reduction generated for u$r
             CC 1.0 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 75% occupancy
     33, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     34, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc do parallel, vector(96)
             Sum reduction generated for u$r1
             CC 1.0 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 75% occupancy
sub1:
     52, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     53, Loop is parallelizable
         Accelerator kernel generated
         53, !$acc do parallel, vector(96)
             Sum reduction generated for u$r
             CC 1.0 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 75% occupancy
sub2:
     68, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     69, Loop is parallelizable
         Accelerator kernel generated
         69, !$acc do parallel, vector(96)
             Sum reduction generated for u$r
             CC 1.0 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 75% occupancy
% pgf90 test.f90 -Mpreprocess -o test_cpu.out
% test_gpu.out
 Total 1:          5050
 Total 2:        338350
 Total 1:          5050
 Total 2:        338350
% test_cpu.out
 Total 1:          5050
 Total 2:        338350
 Total 1:          5050
 Total 2:        338350

Hope this helps,
Mat

Hi Mat,

Thanks for the reply: as usual, quick and informative.

Yes, this code works. As you say, this mixing of CUDA Fortran and directives is experimental. Even so, I’ll report the following bug.

The code below sums an array slice.

-DCase1: Explicit copyin/out of u for each region. Works.
-DCase2: Data region with local u spanning all regions. Works.
-DCase3: DEVICE array u. Does not work.

If I set ns=1, the code appears to work for all three cases.

I get the same result for r1 and r2 as the compiler recognises the reduction for r1.

The compiler was v10.6.0 and the other compilation options were:

pgf90 -ta=nvidia:cc20,time -Mcuda -Minfo=accel

I’m not doing anything wrong, am I?

Cheers,


Alistair

#if defined(Case3) && defined(_ACCEL)
#define CUDA_DEVICE ,device
#else
#define CUDA_DEVICE
#endif

PROGRAM test_sum_reduction

   IMPLICIT NONE

   INTEGER, PARAMETER :: N = 100

   INTEGER CUDA_DEVICE :: u(N) 
   INTEGER :: i,r1,r2

   INTEGER :: ns,no,nf

   ns = 25
   no = 50
   nf = ns + no - 1

#ifdef Case2
!$acc data region local(u)                                                      
#endif

!$acc region                                                                    
   DO i = 1,N
    u(i) = i
   ENDDO
!$acc end region                                                                

!$acc region                                                                    
   r1 = 0
   DO i = ns,nf
    r1 = r1 + u(i)
   ENDDO
!$acc end region                                                                

!$acc region                                                                    
   r2 = SUM(u(ns:nf))
!$acc end region                                                                

#ifdef Case2
!$acc end data region                                                           
#endif

   PRINT *,"Explicit sum:   ",r1
   PRINT *,"SUM intrinsic:  ",r2
   PRINT *,"Correct answer: ",nf*(nf+1)/2 - (ns-1)*(ns-1+1)/2

END PROGRAM test_sum_reduction

Update to message above:

Removing the “!$acc region” statements from around the reductions for r1 and r2 seemed to make it work for Case3.

Is it tautology to make an array “DEVICE” and then address it using a “!$acc region”?

Hi Alistair,

It appears to me that the compiler is starting the summation of CUDA Fortran device “u” at index 1 instead of “ns”. This is because the device pointer is being passed instead of the section. For the accelerator version, only a section of the array (starting at ns) is being passed to the reduction.

I’ll note it,but I don’t think it will be something we’ll fix anytime soon. Hence, please consider this a limitation that when using CUDA Fortran device arrays in reductions, you can only sections that begin at index 1.

Is it tautology to make an array “DEVICE” and then address it using a “!$acc region”?

No. This works because there is an implicit copy from the device into a temporary host array and the summation is done on the host.

It’s the same as:

u_host = u
r2 = SUM(u_host(ns:nf))

Hope this helps,
Mat

Thanks for sorting this out. I’ll note the limitation.

Thanks also for explaining the u_host matter as well. I’ll put the !$acc statements back in, then.

How do I get the compiler to tell me that, for instance, it is copying back to the host? -Minfo=accel does not seem to give any information on this and there did not seem to be a suitable option for either -Minfo or -Mcuda.

Thanks,

Alistair.