How to speed up a 3D triangle filter?

haz13 · March 25, 2014, 6:03pm

Hi everyone.

I have a Fortran CPU code that performs a weighted ‘triangle filter’ in 3D on a 3D array (which had a padded layer of boundary values on all edges preset) by shifting a pointer subarray (2:n-1 in each dimension) using three nested do loops and a weighting factor for the different ‘directions’. Basically this led to it performing 27 operations on often large (500^3) arrays. I tried to implement this into an OpenACC directive but couldn’t get the pointer shifting to work on the device, and so went for an element by element implentation as follows, which gives the same results as the CPU filter:

 subroutine filtergpu(r, a, n)
    real,dimension(:,:,:) :: a,r
    integer :: i,j,k,n
!
! -- loop through the full 3^N stencil
!$acc data region copyin(a(1:n,1:n,1:n)) copyout(r(1:n,1:n,1:n))
!$acc region
    do i = 2,n-1
    do j = 2,n-1
    do k = 2,n-1
       r(i,j,k) =     0.125*(a(i,j,k))&
                 +   0.0625*(a(i-1,j,k)+&
                            a(i,j-1,k)+&
                            a(i,j,k-1)+&
                            a(i+1,j,k)+&
                            a(i,j+1,k)+&
                            a(i,j,k+1))&
                 +  0.03125*(a(i,j+1,k+1)+&
                            a(i,j+1,k-1)+&
                            a(i,j-1,k+1)+&
                            a(i,j-1,k-1)+&
                            a(i+1,j,k+1)+&
                            a(i+1,j,k-1)+&
                            a(i-1,j,k+1)+&
                            a(i-1,j,k-1)+&
                            a(i+1,j+1,k)+&
                            a(i+1,j-1,k)+&
                            a(i-1,j+1,k)+&
                            a(i-1,j-1,k))+&
                 + 0.015265*(a(i+1,j+1,k+1)+&
                            a(i+1,j+1,k-1)+&
                            a(i+1,j-1,k+1)+&
                            a(i+1,j-1,k-1)+&
                            a(i-1,j+1,k+1)+&
                            a(i-1,j+1,k-1)+&
                            a(i-1,j-1,k+1)+&
                            a(i-1,j-1,k-1))
    end do
    end do
    end do
!$acc end region
!$acc end data region
 end subroutine

Now, this is faster than the CPU code running on a single core, but scales up in the same linear fashion as the CPU code does as the matrix size ‘n’ is increased:

n	CPU Time (s)	GPU Time (s)
32	  2.74E-004	 1.62E-004
64	  2.11E-003	 1.00E-003
128	7.65E-003     6.57E-003
256	5.83E-002     4.87E-002
512	0.53	            0.39

I have a feeling this is due to many threads trying to access the same memory locations. I tried another approach and made each of the 27 calculations an individual accelerator region, which also worked but was slower than the code above. I assume this is because it isn’t efficient to launch 27 small kernels one after the other.

So, does anyone have any tips on how to speed this filter up? I am very new to OpenACC coding, haven’t got any experience in CUDA and only really have average experience with Fortran itself.

Thanks, Harry

MatColgrove · April 2, 2014, 5:15pm

Hi Harry,

Minor note, that you may want to swtich to using OpenACC syntax rather than the PGI Accelerator model to allow your code to be compiled by a wider number of compilers.

As for performance, I’d try either reordering the loops or use “loop” directives so that the “I” dimension gets the “vector” loop and thus memory access will be better coalesced across threads. For example:

! -- loop through the full 3^N stencil 
!$acc data copyin(a(1:n,1:n,1:n)) copyout(r(1:n,1:n,1:n)) 
!$acc kernels loop 
    do k = 2,n-1 
    do j = 2,n-1 
    do i = 2,n-1 
       r(i,j,k) =     0.125*(a(i,j,k))&

or

! -- loop through the full 3^N stencil 
!$acc data copyin(a(1:n,1:n,1:n)) copyout(r(1:n,1:n,1:n)) 
!$acc kernels loop vector    
    do i = 2,n-1 
!$acc loop gang vector
    do j = 2,n-1 
!$acc loop gang
    do k = 2,n-1        
r(i,j,k) =     0.125*(a(i,j,k))&

Mat

haz13 · April 2, 2014, 6:07pm

Thank you for that - it has given me a small performance boost, and the code is performing at an expected speed up compared to serial CPU code.

Rather than posting a new question, can I just ask if there is a compiler flag to change between real(4) and real(8)? (I have everything declared as ‘real’ in the code) … it’s just comparing the results of my CPU and GPU implementations is showing some differences and I am interested to see the performance using double precision but would rather not go and rewrite a lot of declarations.

Thanks, Harry

MatColgrove · April 2, 2014, 6:12pm

can I just ask if there is a compiler flag to change between real(4) and real(8)?

Yes, add “-r8” to change the default kind from real(4) to real(8).

haz13 · April 2, 2014, 11:59pm

Ah, thanks, I probably should have been able to guess that! :)