Segmentation fault while passing variables from different modules to kernel

I’m new to CUDA_FORTRAN. Can someone please explain what is wrong with the following code…?
Why am I getting segmentation fault…?

attributes(global) subroutine increment(a)
    implicit none
    real(kind=8),device::a(:)
    integer :: i
    i = threadIdx%x
    a(i)=a(i)+1
    if(a(i)/=2)then
        write(*,*)' Program Passed in Device'
    endif
end subroutine increment

module common
    implicit none
    real(kind=8):: a(2)
    real(kind=8),device:: a_d(2)
end module common
    
program incrementTestGPU
    use cudafor
    use common
    implicit none
    a=1
    a_d=a
    call prac    
end program incrementTestGPU

subroutine prac
    use common
    use cudafor
    call increment<<<1,2>>>(a_d)
    a=a_d
    if(any(a/=2))then
        write(*,*)' **** Program Failed **** '
    else
        write(*,*)' **** Program Passed **** '
    endif
end subroutine prac

One user in StackOverFlow commented
increment is an external subroutine (not in the module), but has an assumed-shape dummy argument. Did you mean for it to have an assumed-size argument ( a(*) instead of a(:) ). I’m not entirely convinced that CUDA Fortran plays well with assumed-shape dummies but to give it a reasonable chance you’ll likely need to give increment an explicit interface in prac.

I don’t understand what he was saying. I tried a(*), a(:), a(2) as dummy arguments, the program fails every time. What does he mean to give increment an explicit interface in prac…? The program is working fine if I change the kernel to a normal subroutine which increments array in a loop. How is calling a kernel subroutine different from calling a normal subroutine…?

What does he mean to give increment an explicit interface in prac …? The program is working fine if I change the kernel to a normal subroutine which increments array in a loop. How is calling a kernel subroutine different from calling a normal subroutine…?

In Fortran a subroutine without an interface uses F77 style API calling conventions which basically just pushes the arguments on the stack and does no argument checking. For the array, only the raw pointer is passed, not the descriptor. So while this may “work” in your simple host example, it’s not really correct given you’re passing in an assume-shape array. The example would most likely fail if you tried to use any intrinsic that needs the array descriptor.

CUDA Fortran does require explicit interfaces to be used so the correct data is passed. Besides the descriptor, things like the ‘device’ or ‘value’ attributes are only exposed via an interface.

Looking at your SO post (Segmentation fault while passing variables from different modules to kernel in CUDA Fortran - Stack Overflow), it seems you we able to get past this issue. I’ll post an answer to your second SO post.

Thank you so much @MatColgrove