Hi,

I have written the following subroutine for 3D median filtering using OPENACC and Fortran.

```
SUBROUTINE medianFiltReal_3D_acc(a, n)
IMPLICIT NONE
REAL(KIND = dp), INTENT(INOUT), DIMENSION(:, :, :) :: a
INTEGER, INTENT(IN) :: n
REAL(KIND = dp), ALLOCATABLE, DIMENSION(:, :, :) :: aCopy
REAL(KIND = dp), ALLOCATABLE, DIMENSION(:) :: a_kernel
INTEGER :: nHalf = 0, i1 = 0, i2 = 0, i3 = 0, n1 = 0, n2 = 0, n3 = 0, ii1=0,ii2=0,ii3=0
ALLOCATE(aCopy(SIZE(a, 1), SIZE(a, 2), SIZE(a, 3)), a_kernel(n**3))
aCopy = a
nHalf = NINT(REAL(n - 1) / 2.0_dp)
n1 = SIZE(a, 1)
n2 = SIZE(a, 2)
n3 = SIZE(a, 3)
!$acc data copyin(a,nHalf,n,n1,n2,n3)
!$acc data copyout(aCopy)
!$acc data create(a_kernel)
!$acc parallel loop gang collapse(3)
DO i3 = 1 + nHalf, n3 - nHalf
DO i2 = 1 + nHalf, n2 - nHalf
DO i1 = 1 + nHalf, n1 - nHalf
!a_kernel = PACK(a(i1-nHalf:i1+nHalf,i2-nHalf:i2+nHalf,i3-nHalf:i3+nHalf), MASK=.TRUE.)
!$acc loop vector collapse(3)
DO ii3 = -nHalf, nHalf
DO ii2 = -nHalf, nHalf
DO ii1 = -nHalf, nHalf
a_kernel(1 + (ii1 + nHalf) + (ii2 + nHalf) * n + (ii3 + nHalf) * n * n) = a(i1 + ii1, i2 + ii2, i3 + ii3)
END DO
END DO
END DO
aCopy(i1, i2, i3) = medianReal(a_kernel)
END DO
END DO
END DO
!$acc end parallel
!$acc end data
!$acc end data
!$acc end data
a = aCopy
DEALLOCATE(aCopy)
END SUBROUTINE medianFiltReal_3D_acc
```

Subroutine `medianReal`

returns the median of a 1D array. It employs a recursive quicksort algorithm.

- Compilation yields the following warning:

nvlink warning : Stack size for entry function ‘myutils_mod_medianfiltreal_3d_acc_6260_gpu’ cannot be statically determined

On execution the program fails with the error `FATAL ERROR: ARRAY IS NOT ALLOCATED`

. A little online search reveals that the above issues do occur when using recursive functions. Is there a way to move ahead without modifying my sorting functions? I might need to increase the stack size…but I am unsure how to do it.

- In addition, the 3 inner DO LOOPS are in lieu of the PACK fortran function. The PACK function is not recognized by the compiler when inside a compute region. Other fortran functions like MINLOC also don’t seem to work on the device. How can I get them working on the device?

I would appreciate any help in de-bugging the above subroutine and also speeding it up.

Cheers,

Jyoti