CUDA Fortran device kernels require interfaces in order to be called from host code. Hence, you either write an explicit interface for S in your main program or put S in a module where an implicit interface is generated.
module foo
contains
attributes(global) subroutine S
end subroutine S
end module foo
program P
use foo
call S<<<1, 1>>>()
end program P