This is my case, t_ptr is a pointer to pointer to a 3-D array (t_ptr => ptr => u1), and sclr is a scalar value. Following is my code snippet, which gives a “Segmentation Fault”.
#ifdef __PGI
!$acc data region copy(t_ptr) copyin(sx, sy, sz, sclr)
!$acc region do parallel
#endif
do i=1,1
t_ptr(sx, sy, sz) = t_ptr(sx, sy, sz) + sclr
enddo
#ifdef __PGI
!$acc end region
!$acc end data region
#endif
Following is the informational messages:
13, PGI Unified Binary version for -tp=nehalem-64 -ta=nvidia
32, Loop unrolled 3 times (completely unrolled)
36, Generating copyin(sclr)
Generating copyin(sz)
Generating copyin(sy)
Generating copyin(sx)
Generating copy(ptr(:,:,:))
38, Generating copy(t_ptr(sx,sy,sz))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
40, Complex loop carried dependence of 't_ptr' prevents parallelization
Loop carried dependence due to exposed use of 't_ptr(sx,sy,sz)' prevents parallelization
Accelerator kernel generated
40, !$acc do seq
CC 1.0 : 2 registers; 44 shared, 0 constant, 0 local memory bytes; 33% occupancy
CC 2.0 : 4 registers; 0 shared, 60 constant, 0 local memory bytes; 16% occupancy
First, F90 pointers are yet supported within Accelerator regions. We need to aliasing issues before this can be added but our hope is that it’s a solvable problem.
Second, you don’t want to copy your scalars in the data region. This promotes them to global memory hurting performance. Instead, remove the copyin clause and let the compiler either privatize them or pass them in as arguments to the generated kernel.
Third, the loop is not parallel since every iteration of the loop updates the same element of t_ptr. Granted, the trip count is 1 so there’s no parallelism to begin with, but the compiler’s dependency analysis doesn’t take the trip count into account when determining independence.
Thank you for your reply. I guess the only way left would be to access the arrays directly, and let the compiler recognize the reduction.
Please bear with me, I am new in Fortran and after your reply I have a question w.r.t array processing. If there is an operation like:
A = A + B
or
A = A + (B*C)
where, A, B and C are arrays with same shape, then if such an operation
occurs within a compute region then would this operation be moved to the device?
Is there a way that the compiler would be able to recognize the array reduction inside the acc region and make it parallel?
Yes, the compiler will automatically detect reductions and generate an optimal parallel reduction code. Though, the reduction variable must be a scalar and the reduction performed across the outer loop.
For example:
sum =0
!$acc region
do k=k0,k1
do j=j0,j1
do i=i0,i1
sum = sum + u(i,j,k)
enddo
enddo
enddo
!$acc end region
As part of our OpenACC implementation were working on adding support for inner loop reductions.
!$acc parallel
!$acc loop gang
do k=k0,k1
sum = 0
!$acc loop vector(32), reduction(+:sum)
do j=j0,j1
sum = sum + u(j,k)
enddo
array(k) = sum
enddo
!$acc end parallel