Error in passing arguments using hybrid openacc and mpi

HI, I would like to ask for some help for the error when passing arguments to the subroutine when using hybrid Openacc and MPI programing model. I have simplified the program like this:
module gulemath
contains
SUBROUTINE write_int(m,devicenum)
!$acc routine seq
implicit none
INTEGER(kind=4) m,devicenum

	  write(*,*) 'm and device_num',m,devicenum

	  end

end module gulemath

program main
use gulemath
use mpi
use openacc

implicit none

integer(kind=4)::i,n_m,nmax,n_north,prd
integer(kind=4):: num_device,idx_device
character(len=40) filedgcombine,filecoef1,prc_name
integer(kind=4)::ierr,myid,numprocs,rc	

call MPI_INIT_thread(MPI_THREAD_MULTIPLE,prd,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,numprocs,ierr)

nmax=360
n_north=180

num_device= acc_get_num_devices(acc_device_nvidia)

If(myid/=0) then
	idx_device=mod(myid,num_device)
	call acc_init(acc_device_nvidia)
	call acc_set_device_num(idx_device,acc_device_nvidia)
	!$acc kernels
	!$acc loop independent
	do i=myid, n_north, numprocs-1  ! loop 1
		call write_int(i,idx_device)
		n_m=i
		call write_int(n_m,idx_device)
	end do   ! loop i
	!$acc end kernels
	write(*,*) 'after loop 1'
end if

If(myid/=0) then
	write(*,*) 'my id', myid,numprocs
	idx_device=mod(myid,num_device) 
	call acc_set_device_num(idx_device,acc_device_nvidia)
	!$acc parallel 
	!$acc loop independent 
	do i=myid-1,nmax-1 ,numprocs-1   ! loop  2
		write(*,*) i
		call write_int(i,idx_device)
		n_m=i
		call write_int(n_m,idx_device)
	end do   ! loop i
	!$acc end parallel
end if

call MPI_FINALIZE(rc)

end ! the main program

You could see that there are two loops which could be executed in parallel. In each loop, I would like to pass the loop variable to the subroutine and print out the variable. It could be pass right in the first loop, however, it is not correctly passed in the second loop.

I compile the program in this command line:
mpif90 -acc -gpu=cc70 -gpu=cuda11.0 -Minfo -Mlarge_arrays printm.f90 -o printm

and run it like this:
srun -p sgpu mpirun -np 2 --oversubscribe ./printm

Could you please tell me where is wrong?
Many thanks!

Here’s the compiler feedback messages:

% mpif90 test.f90 -acc -Minfo=accel
write_int:
      3, Generating acc routine seq
         Generating Tesla code
main:
     39, Generating implicit copy(n_m,idx_device) [if not already present]
     41, Loop is parallelizable
         Generating Tesla code
         41, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     54, Generating Tesla code
         56, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

Notice the implicit copy of “n_m”. While by default scalars are privatized, here you’re passing “n_m” by reference to the subroutine (Fortran defaults to passing by reference). In this case, the compiler must assume that the reference to n_m will be taken by another variable therefore needs to make the variable shared.

Here, the solution is to add the “value” attribute to the argument so they are no longer passed by reference.

SUBROUTINE write_int(m,devicenum)
!$acc routine seq
implicit none
INTEGER(kind=4),value  :: m,devicenum
!INTEGER(kind=4)  :: m,devicenum

          write(*,*) 'm and device_num',m,devicenum

          end
end module gulemath

Hope this helps,
Mat

Hi, Mat,

many thanks, the solution works now, which helps me a lot for my work.

So, you mean in the parallel region, once a scalar variable is to be passed through subroutines, then variable would be made shared variable and the variable would be passed by reference. Then, the different thread would read value from the same address and they would conflict. Thus,
I couldn’t print the true value. Is my understanding right?

However, if this understanding is right, another question would arise, which is that why in the
main program, before calling the subroutine, the threads could read and write to the shared address and no error is found. But why the threads would conflict int he subroutines.

Moreover, I am still confused on the strange behaviors of different scalar variables in the program. For demonstrating the problem, I still stick to the original program in which the variables are passed by reference. And I post it here:

module gulemath
contains
SUBROUTINE write_int(m,devicenum)
!$acc routine seq
implicit none
INTEGER(kind=4):: m,devicenum

	  write(*,*) 'm and device_num',m,devicenum

	  end

end module gulemath

program main
use gulemath
use mpi
use openacc

implicit none

integer(kind=4)::i,n_m,nmax,n_north,prd
integer(kind=4):: num_device,idx_device
character(len=40) filedgcombine,filecoef1,prc_name
integer(kind=4)::ierr,myid,numprocs,rc	

call MPI_INIT_thread(MPI_THREAD_MULTIPLE,prd,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,numprocs,ierr)

nmax=360
n_north=180

num_device= acc_get_num_devices(acc_device_nvidia)

If(myid/=0) then
	idx_device=mod(myid,num_device)
	call acc_init(acc_device_nvidia)
	call acc_set_device_num(idx_device,acc_device_nvidia)
	!$acc kernels
	!$acc loop independent
	do i=myid, n_north, numprocs-1  ! loop 1
		call write_int(i,idx_device)
		n_m=i
		call write_int(n_m,idx_device)
	end do   ! loop i
	!$acc end kernels
	write(*,*) 'after loop 1'
end if

n_m=0
If(myid/=0) then
	write(*,*) 'my id', myid,numprocs
	idx_device=mod(myid,num_device) 
	call acc_set_device_num(idx_device,acc_device_nvidia)
	!$acc parallel 
	!$acc loop independent 
	do i=myid-1,nmax-1 ,numprocs-1   ! loop  2
		write(*,*) i
		call write_int(i,idx_device)
		n_m=i
		write(*,*) 'mmmmm',n_m
		call write_int(n_m,idx_device)
		call write_int(n_north,idx_device)
	end do   ! loop i
	!$acc end parallel
end if

call MPI_FINALIZE(rc)

end ! the main program

At first, I am not sure whether the variables have different behaviors in the first "loop 1’ and the second loop ‘loop 2’. In the first loop I could print the value correctly, however, I could not print the correct value or even could not print out it.

Secondly, you could see in the second loop, there are three scalar variables, ‘i’, ‘n_m’ and ‘n_north’. Why the three values have different behaviors. For ‘i’, I could print it in the subroutine, but the value is wrong; For ‘n_m’, I could not print the value, I can only print out null. However, I could print out the right value of ‘n_north’ in the subroutine. Could you please tell me the reason behind?

At last, would the employment of MPI here bring problem to the program?

Many thanks!

Digging a bit more, there does seem to be a code generation issue here when the assignment of “n_m=i” is getting optimized away. The compiler is correctly replacing the reference to the write call, but not to the write_int call.

I still highly recommend using “value” for the reason given before, but it also happens to work-around this issue since the compiler no longer needs to generate a reference and instead can pass the value of the variable in directly.

At last, would the employment of MPI here bring problem to the program?

No, MPI is just a set of host API calls so would have no effect on the generated OpenACC code.

-Mat

OK, I have got it. I have change the related subroutines so that the parameters are passed by value.

Many thanks!