Unusually slow MPI communication between GPUs

hemanthgrylls · September 4, 2023, 4:42pm

Hi, I am running a large MPI+OpenACC based parallel multi-GPU program written in fortran on 4 A100 (40GB) GPUs on the NVIDIA DGX cluster. The communication bit of the program between the GPUs is supposedly very small, should account for 5-10% of the total time per iteration. But for some reason the communication part is consuming more than 50% of the total time per iteration. This is making the overall efficiency of my code very bad in multi-GPU runs. My communication routine is quite basic and looks as follows:

SUBROUTINE MPI_INTERFACE_on_TOP(block, Home_Block, Neigh_Block, home_block_rank, neigh_block_rank, comm_tag)
	
	use global_variables
	use mpi_variables
	use mpi
	implicit none
	
	integer :: home_block_rank, neigh_block_rank
	integer :: block, Home_Block, Neigh_Block
	double precision :: Qp_packed(NK(block)*NGh*NI(block)*nprims)
	integer :: SPt_I, EPt_I, SPt_J, EPt_J, SPt_K, EPt_K, from
	integer :: package_size, comm_tag
	integer :: status(MPI_STATUS_SIZE)
	
	package_size = NK(block)*NGh*NI(block)*nprims
	
	IF (rank == neigh_block_rank) THEN
		
		! pack from neigh_block
		nbl = Neigh_Block
		SPt_J = 1;				EPt_J = NGh
		SPt_K = 1;				EPt_K = NK(nbl)
		SPt_I = 1;				EPt_I = NI(nbl)
		call PACK_DATA(Qp, Qp_packed, nbl, SPt_I, EPt_I, SPt_J, EPt_J, SPt_K, EPt_K, NI,NJ,NK,nblocks,nprims,NGh)
		
		! send
		dest_rank = home_block_rank;	tag = comm_tag
		#ifdef CUDA_AWARE_MPI
			!$acc host_data use_device(Qp_packed)
		#else
			!$acc update self(Qp_packed)
		#endif
			call MPI_SEND(Qp_packed, package_size, MPI_DOUBLE_PRECISION, dest_rank, tag, MPI_COMM_WORLD, ierr)
		#ifdef CUDA_AWARE_MPI
			!$acc end host_data
		#endif
		
	ELSEIF (rank == home_block_rank) THEN
		
		! recieve
		from = neigh_block_rank;		tag = comm_tag;
		#ifdef CUDA_AWARE_MPI
			!$acc host_data use_device(Qp_packed)
		#endif
			call MPI_RECV(Qp_packed, package_size, MPI_DOUBLE_PRECISION, from, tag, MPI_COMM_WORLD, status, ierr)
		#ifdef CUDA_AWARE_MPI
			!$acc end host_data
		#else
			!$acc update device(Qp_packed)
		#endif
		
		! unpack into home block
		nbl = Home_Block
		SPt_J = NJ(nbl)+1;	EPt_J = NJ(nbl)+NGh
		SPt_K = 1;			EPt_K = NK(nbl)
		SPt_I = 1;			EPt_I = NI(nbl)
		call UNPACK_DATA(Qp, Qp_packed, nbl, SPt_I, EPt_I, SPt_J, EPt_J, SPt_K, EPt_K, NI,NJ,NK,nblocks,nprims,NGh)
	ENDIF
	
END


SUBROUTINE PACK_DATA(Qp, Qp_packed, nbl, SPt_I, EPt_I, SPt_J, EPt_J, SPt_K, EPt_K, NI,NJ,NK,nblocks,nprims,NGh)

	implicit none
	
	integer :: SPt_I, EPt_I, SPt_J, EPt_J, SPt_K, EPt_K, nblocks,nprims,NGh
	integer, dimension(nblocks) :: NI, NJ, NK
	integer :: i,j,k,nbl,npr
	double precision, dimension(-NGh+1:MAXVAL(NI)+NGh, -NGh+1:MAXVAL(NJ)+NGh, -NGh+1:MAXVAL(NK)+NGh, nblocks, nprims):: Qp
	double precision :: Qp_packed((EPt_I-SPt_I+1)*(EPt_J-SPt_J+1)*(EPt_K-SPt_K+1)*nprims)
	
	!$acc parallel loop gang vector collapse(4) default(present)
	DO npr = 1,nprims
	DO k = SPt_K,EPt_K
	DO j = SPt_J,EPt_J
	DO i = SPt_I,EPt_I
		Qp_packed(1											&
			   + (i-SPt_I)									&
			   + (EPt_I-SPt_I+1)*(j-Spt_J) 					&
			   + (EPt_I-SPt_I+1)*(EPt_J-SPt_J+1)*(k-SPt_K) 	&
			   + (EPt_I-SPt_I+1)*(EPt_J-SPt_J+1)*(EPt_K-SPt_K+1)*(npr-1)) = Qp(i,j,k,nbl,npr)
	ENDDO
	ENDDO
	ENDDO
	ENDDO
	
END



SUBROUTINE UNPACK_DATA(Qp, Qp_packed, nbl, SPt_I, EPt_I, SPt_J, EPt_J, SPt_K, EPt_K, NI,NJ,NK,nblocks,nprims,NGh)

	implicit none
	
	integer :: SPt_I, EPt_I, SPt_J, EPt_J, SPt_K, EPt_K, nblocks,nprims,NGh
	integer, dimension(nblocks) :: NI, NJ, NK
	integer :: i,j,k,nbl,npr
	double precision, dimension(-NGh+1:MAXVAL(NI)+NGh, -NGh+1:MAXVAL(NJ)+NGh, -NGh+1:MAXVAL(NK)+NGh, nblocks, nprims):: Qp
	double precision :: Qp_packed((EPt_I-SPt_I+1)*(EPt_J-SPt_J+1)*(EPt_K-SPt_K+1)*nprims)
	
	!$acc parallel loop gang vector collapse(4) default(present)
	DO npr = 1,nprims
	DO k = SPt_K,EPt_K
	DO j = SPt_J,EPt_J
	DO i = SPt_I,EPt_I
		Qp(i,j,k,nbl,npr) =  Qp_packed(1											&
						            + (i-SPt_I)										&
						            + (EPt_I-SPt_I+1)*(j-Spt_J) 					&
						            + (EPt_I-SPt_I+1)*(EPt_J-SPt_J+1)*(k-SPt_K) 	&
						            + (EPt_I-SPt_I+1)*(EPt_J-SPt_J+1)*(EPt_K-SPt_K+1)*(npr-1))
	ENDDO
	ENDDO
	ENDDO
	ENDDO
	
END

As you can see, the communication routine (1st subroutine) is quite simple which involves packing → sending → receiving ->unpacking.

I have used the same routines as a part of a smaller toy program transferring arrays of similar size, and it seems the communication speed is fine in that toy program, giving perfect scaling. However in my real program which has lot of global variables and lot of computations involved, it seems for some reason affecting the communication part and is making it quite slow. I don’t understand why in my real program the communication is slow. What are the possible reasons that I am facing this issue? And is there a way to fix it? Is there a better and more efficient way to transfer memory from one GPU to another, other than the naive way that I have implemented?

Thanks

MatColgrove · September 5, 2023, 4:56pm

Hi hemanthgrylls,

I’m not seeing anything obvious in the code you show. How are you collecting the profile data?

If you’re not using it already, Nsight-Systems is able to profile multiple ranks and show you the data communication. Something like:

nsys profile -t cuda,openacc,mpi -o myprofile1 mpiexec -np ....

Then open the profile in Nsight-Systems GUI so you can view the timeline.

In particular, I’d look to make sure device to device data communication is occurring and it’s not doing host to device data copies (i.e. make sure CUDA Aware MPI is being used).

If it’s not, then check if “CUDA_AWARE_MPI” is defined during your build, and CUDA Aware MPI is enabled with your mpirun.

If it is using CUDA Aware MPI, what does the profile show? Could the issue be with a different part of the code and not the MPI Communication?

Which MPI are you using?

-Mat

Topic		Replies	Views
Beginning Question about CUDA-aware MPI nvc, nvc++ and nvfortran	6	48	May 10, 2025
An Introduction to CUDA-Aware MPI Technical Blog	5	959	August 30, 2019
About the inefficiency of the CUDA-aware GPU-to-GPU communication nvc, nvc++ and nvfortran	20	648	December 20, 2024
MPI send + OpenACC + acc_malloc fail with NVFortran, but work with C nvc, nvc++ and nvfortran	10	104	September 6, 2024
Benchmarking CUDA-Aware MPI Technical Blog	16	1393	August 20, 2019
Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand nvc, nvc++ and nvfortran openmpi	12	2404	March 11, 2024
Running nvidia Fortran on multiple GPUs with MPI nvc, nvc++ and nvfortran	36	308	December 14, 2024
Things goes wrong after "host_data use_device" section nvc, nvc++ and nvfortran openmpi	5	312	February 14, 2024
Good reference/examples for CUDA fortran with MPI, please? Legacy PGI Compilers	9	1301	April 5, 2024
Memory increase in GPU-aware non-blocking MPI communications CUDA Programming and Performance	5	403	October 8, 2024

Unusually slow MPI communication between GPUs

Related topics