Hi everyone.

I have a Fortran CPU code that performs a weighted ‘triangle filter’ in 3D on a 3D array (which had a padded layer of boundary values on all edges preset) by shifting a pointer subarray (2:n-1 in each dimension) using three nested do loops and a weighting factor for the different ‘directions’. Basically this led to it performing 27 operations on often large (500^3) arrays. I tried to implement this into an OpenACC directive but couldn’t get the pointer shifting to work on the device, and so went for an element by element implentation as follows, which gives the same results as the CPU filter:

```
subroutine filtergpu(r, a, n)
real,dimension(:,:,:) :: a,r
integer :: i,j,k,n
!
! -- loop through the full 3^N stencil
!$acc data region copyin(a(1:n,1:n,1:n)) copyout(r(1:n,1:n,1:n))
!$acc region
do i = 2,n-1
do j = 2,n-1
do k = 2,n-1
r(i,j,k) = 0.125*(a(i,j,k))&
+ 0.0625*(a(i-1,j,k)+&
a(i,j-1,k)+&
a(i,j,k-1)+&
a(i+1,j,k)+&
a(i,j+1,k)+&
a(i,j,k+1))&
+ 0.03125*(a(i,j+1,k+1)+&
a(i,j+1,k-1)+&
a(i,j-1,k+1)+&
a(i,j-1,k-1)+&
a(i+1,j,k+1)+&
a(i+1,j,k-1)+&
a(i-1,j,k+1)+&
a(i-1,j,k-1)+&
a(i+1,j+1,k)+&
a(i+1,j-1,k)+&
a(i-1,j+1,k)+&
a(i-1,j-1,k))+&
+ 0.015265*(a(i+1,j+1,k+1)+&
a(i+1,j+1,k-1)+&
a(i+1,j-1,k+1)+&
a(i+1,j-1,k-1)+&
a(i-1,j+1,k+1)+&
a(i-1,j+1,k-1)+&
a(i-1,j-1,k+1)+&
a(i-1,j-1,k-1))
end do
end do
end do
!$acc end region
!$acc end data region
end subroutine
```

Now, this is faster than the CPU code running on a single core, but scales up in the same linear fashion as the CPU code does as the matrix size ‘n’ is increased:

```
n CPU Time (s) GPU Time (s)
32 2.74E-004 1.62E-004
64 2.11E-003 1.00E-003
128 7.65E-003 6.57E-003
256 5.83E-002 4.87E-002
512 0.53 0.39
```

I have a feeling this is due to many threads trying to access the same memory locations. I tried another approach and made each of the 27 calculations an individual accelerator region, which also worked but was slower than the code above. I assume this is because it isn’t efficient to launch 27 small kernels one after the other.

So, does anyone have any tips on how to speed this filter up? I am very new to OpenACC coding, haven’t got any experience in CUDA and only really have average experience with Fortran itself.

Thanks, Harry