Dear experts,
I use mpi_allreduce with device buffer. But profiling results show that the time used by this function increases with simulation. However, it should not be so.
real(8),allocatable,device:: PET_balance_de(:)
allocate(PET_balance_de(2))
call MPI_ALLReduce(MPI_IN_PLACE,PET_balance_de,2,MPI_DOUBLE_PRECISION,
MPI_SUM,MPI_COMM_WORLD,ierr)
PET_balance_de is an two elements array independent on the timestep of the code, so it should not increase with simulation time. However, you can see
It is weird. Could you please give some ideas?
Thanks much in advance!
Chen