3D Median Filter Fortran+OPENACC


I have written the following subroutine for 3D median filtering using OPENACC and Fortran.

    SUBROUTINE medianFiltReal_3D_acc(a, n)
        REAL(KIND = dp), INTENT(INOUT), DIMENSION(:, :, :) :: a
        INTEGER, INTENT(IN) :: n

        REAL(KIND = dp), ALLOCATABLE, DIMENSION(:, :, :) :: aCopy
        REAL(KIND = dp), ALLOCATABLE, DIMENSION(:) :: a_kernel
        INTEGER :: nHalf = 0, i1 = 0, i2 = 0, i3 = 0, n1 = 0, n2 = 0, n3 = 0, ii1=0,ii2=0,ii3=0

        ALLOCATE(aCopy(SIZE(a, 1), SIZE(a, 2), SIZE(a, 3)), a_kernel(n**3))
        aCopy = a
        nHalf = NINT(REAL(n - 1) / 2.0_dp)
        n1 = SIZE(a, 1)
        n2 = SIZE(a, 2)
        n3 = SIZE(a, 3)
        !$acc data copyin(a,nHalf,n,n1,n2,n3)
        !$acc data copyout(aCopy)
        !$acc data create(a_kernel)
        !$acc parallel loop gang collapse(3)
        DO i3 = 1 + nHalf, n3 - nHalf
            DO i2 = 1 + nHalf, n2 - nHalf
                DO i1 = 1 + nHalf, n1 - nHalf
                    !a_kernel = PACK(a(i1-nHalf:i1+nHalf,i2-nHalf:i2+nHalf,i3-nHalf:i3+nHalf), MASK=.TRUE.)
                    !$acc loop vector collapse(3)
                    DO ii3 = -nHalf, nHalf
                        DO ii2 = -nHalf, nHalf
                            DO ii1 = -nHalf, nHalf
                                a_kernel(1 + (ii1 + nHalf) + (ii2 + nHalf) * n + (ii3 + nHalf) * n * n) = a(i1 + ii1, i2 + ii2, i3 + ii3)
                            END DO
                        END DO
                    END DO
                    aCopy(i1, i2, i3) = medianReal(a_kernel)
                END DO
            END DO
        END DO
        !$acc end parallel
        !$acc end data
        !$acc end data
        !$acc end data
        a = aCopy
    END SUBROUTINE medianFiltReal_3D_acc

Subroutine medianReal returns the median of a 1D array. It employs a recursive quicksort algorithm.

  1. Compilation yields the following warning:

nvlink warning : Stack size for entry function ‘myutils_mod_medianfiltreal_3d_acc_6260_gpu’ cannot be statically determined

On execution the program fails with the error FATAL ERROR: ARRAY IS NOT ALLOCATED. A little online search reveals that the above issues do occur when using recursive functions. Is there a way to move ahead without modifying my sorting functions? I might need to increase the stack size…but I am unsure how to do it.

  1. In addition, the 3 inner DO LOOPS are in lieu of the PACK fortran function. The PACK function is not recognized by the compiler when inside a compute region. Other fortran functions like MINLOC also don’t seem to work on the device. How can I get them working on the device?

I would appreciate any help in de-bugging the above subroutine and also speeding it up.


The PACK function is not recognized by the compiler when inside a compute region. Other fortran functions like MINLOC also don’t seem to work on the device. How can I get them working on the device?

In order to support Fortran intrinsics on the device, we need to inline them. Unfortunately not everything can be, including PACK.

For MINLOC, try adding “use cudafor” to get the CUDA Fortran version included. I have an open issue report on getting it to work with pure OpenACC, but engineering hasn’t implemented it as of yet.

Recursion in general is problematic given the device stack is so small, but as long as the call depth isn’t to big it should be ok. You can also increase the stack size via the environment variable “NV_ACC_CUDA_STACKSIZE”. However there is an upper limit (the exact limit changes depending on your device can CUDA version), so you still can’t go to far.

I presume the recursive function is “medianReal” and that it’s not a generic for
“medianFiltReal_3D_acc”? This would be a problem since you can’t have a device routine include data or compute regions.

Does “medianReal” do allocation? If so, then this could be the cause of the allocation error. The default heap on the device is quite small as well, but you can increase it via the environment variable “NV_ACC_CUDA_HEAPSIZE”. Here’s there’s no upper limit except for the amount of available memory.

Though in general we don’t recommend allocating on the device. Besides the heap size limit, allocation is quite slow. If possible, you’ll want to use fixed size arrays, or pass in a private array to the routine.

The call to “medianReal” is being done from the gang, meaning only one thread will be actively calling it. Unless it’s a “routine vector” and you have vector loops inside it?

Can you provide a complete reproducing example? I think that will help since I’m having to speculate on what “medianReal” is so difficult to give good advice.

Thanks for the clarifications.
Yes, medianReal and the quicksort algorithm QsortC do allocations.

    FUNCTION medianReal(a) RESULT(med)
        !$acc routine seq
        REAL(KIND = dp), INTENT(IN), DIMENSION(:) :: a

        REAL(KIND = dp) :: med
        INTEGER :: l
        REAL(KIND = dp), DIMENSION(SIZE(a, 1)) :: ac

        ac = a
        CALL QsortC(ac)

        l = SIZE(a, 1)
        IF (mod(l, 2) == 0) THEN
            med = (ac(l / 2 + 1) + ac(l / 2)) / 2.0
            med = ac(l / 2 + 1)
        END IF
    END FUNCTION medianReal
    SUBROUTINE QsortC(A, ind)

        INTEGER :: i = 0
        REAL(KIND = dp), DIMENSION(:, :), ALLOCATABLE :: Acopy

        ALLOCATE(Acopy(SIZE(A), 2))
        ACopy(:, 1) = A
        Acopy(:, 2) = [(REAL(i, KIND = dp), i = 1, SIZE(A))]

        CALL Qsort(Acopy)

        A = Acopy(:, 1)
        IF (PRESENT(ind)) ind = INT(Acopy(:, 2))


Recursive subroutine Qsort does not perform any allocations.

For automatics like “ac”, the message should have the string “AUTO ARRAY”. Without “AUTO” it implies the error is coming from an explicit allocation. Does QsortC use “allocate” or is it using automatics as well?

If it’s using “allocate”, then it’s likely the device has run out of heap which you should try increasing using NV_ACC_CUDA_HEAPSIZE (or alternatively you can call cudaDeviceSetLimit.

If it’s using automatics, then the error might be coming from someplace else. Though I’d still try increasing the heap.

For performance, I’d try moving the “vector” to the outer loops (i,e, “acc parallel loop gang vector collapse(3)”) and run the inner “PACK” loops sequentially. You’ll then have multiple threads calling “medianReal”. The caveat being that the required heap size will increase proportionally.

Thanks. As always, I appreciate your timely and thorough response.