Problems on OpenMP and multi-GPU

Greetings,

I’m a researcher on molecular dynamics, and I’m trying to write a CUDA-Fortran code to get the sum of thousands of pair potentials. As the number of pairs is huge in my system, I have to use multiple GPUs. The idea is to separate the whole system into two parts. I have two GPUs in my computational node, TESLA C2070 and GT 440, and I try to make them work together by OpenMP.

Now I get an error:
0: copyout Memcpy (host=0x7f49880016c0, dev=0x200300800, size=56) FAILED: 4(unspecified launch failure)

The main program is:

!$omp parallel &
	!$omp private(iCPU, iStart, iEnd, uTot, nBlockSP, prop)
	!$omp master
	nCPUs = omp_get_num_threads() 
	print*,'# of CPU(s):',nCPUs
	!$omp end master 
	!$omp barrier 
	
	iCPU = omp_get_thread_num()
	istat = cudaSetDevice(iCPU)
	call iniGPU() !! copy some parameters to the device
	call uptAllHostToDev() !! copy coordinates to the device
	!$omp barrier 
	

	if(iCPU == 0)then
		iStart = 1
		iEnd = nPartA
		nBlockSP = nBlocksA
	else
		iStart = nPartA + 1
		iEnd = nPartsTot
		nBlockSP = nBlocksB
	end if
	
	call getFullEnSP(iStart, iEnd, nBlockSP, uTot)
	print*,iCPU, uTot
	
	istat=cudaThreadExit() 
!$omp end parallel

where nBlocksA=7(on device 0: TESLA) and nBlocksB=2(on device 1 GT 440). And the subroutine ‘getFullEnSP’ is defined as:

subroutine getFullEnSP(iStart, iEnd, nBlockSP, uTot)
	
	real*8 :: uTot
	integer :: nBlockSP,iStart,iEnd
	
	integer :: iPart
	integer, device :: d_iPart, d_iStart, d_iEnd
	real*8, allocatable, dimension(:), device :: d_block
	real*8, allocatable, dimension(:) :: blockEnergy
	integer :: ierr
	
	allocate(blockEnergy(nBlockSP),d_block(nBlockSP))
	uTot = 0.d0
	blockEnergy = 0.d0
	d_block = 0.d0
	!! GPU total energy
	do iPart=1, nPartsTot
		d_iPart = iPart
		d_iStart = iStart
		d_iEnd = iEnd
		call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_block)
		blockEnergy(:) = d_block(:) !!!***ERROR IS HERE!***
		uTot = uTot + sum(blockEnergy)
	end do
	deallocate(blockEnergy,d_block)
	
	uTot = uTot / 2.d0
	print*,uTot
end subroutine

According to the error information, it is easy to locate the error in the subroutine:

blockEnergy(:) = d_block(:)
(Here 7size(real8) is 56.)

It’s OK if I use only one device, but once I try to use 2, here comes the error.

Does anyone has suggestions?

Thanks in advance

Hi Weixiao,

call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_block)
blockEnergy(:) = d_block(:) !!!ERROR IS HERE!

The error is most likely coming from your kernel and not the copy. Add a call to “cudaGetLastError” just after your kernel launch to see if you can catch this error.

Since the code does work with a single OpenMP thread, I’m guessing that there is some shared global device data that’s being used by both threads. This of course causes problems since multiple GPU context can’t share data.

If you can post a complete example of your code or send the full version to PGI Customer service (trs@pgroup.com), I can help find the problem. Otherwise, look at your kernel to see if it’s accessing any device module variables that have not been made OpenMP private.

  • Mat

Hi, Mat!

Thanks a lot for your reply. I have sent the code, and please help to check it.

Best,
Weixiao.

Hi Weixiao,

I see a number of fundamental issues. First, you initialize the device and allocated data before entering the OpenMP region and setting the Cuda device. You need to wait to use the device before you set the context, otherwise a context is implicitly set.

Second, re-evaluate how you are doing your domain decomposition. It looks like of the you divide up the 4000 blocks with the first 3200 on the first GPU and 800 on the second. More importantly, you create an device arrays of the partitioned blocks but don’t adjust you indexing. Hence each openMP thread is is looping through 1 to nPartTot but each array is only a subset of nPartTot.

That’s all the time I have for today, and unfortunately the code still breaks. Hopefully you can work out the last problems.

  • Mat

Hi, Mat!

Thanks for your last reply. I have fixed the bug that to initialize the device too early.

I found a naive way to solve the last problem:

Copy the kernel function ‘getPartEnSP’ and rename it as ‘getPartEnSP2’, then use

do iPart=1, nPartsTot
		d_iPart = iPart
		if(omp_get_thread_num == 0) then
			call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_nBlockSP, d_block)
		else
			call getPartEnSP2<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_nBlockSP, d_block)
		end if
		blockEnergy(:) = d_block(:)
		uTot = uTot + sum(blockEnergy)

end do

to run then seperately.

Although it is not an intelligent way, it could work somehow. Do you have any idea about reusing the kernel instead of renaming it?

Best,
Weixiao.

Hi Weixiao,

Since I don’t know your algorithm it hard to give you good advice on how to reorganize your code. However, hard coding the number of OpenMP threads is a poor choice. Ideally you want to decompose your problem amongst an arbitrary number of threads.

Can the do iPart loop be divided up amongst the threads? If I read your code correctly, you have each threads execute every part but then divide up the domain within the kernel itself. If you can do the domain decomposition higher up, then your kernel can just work on it portion of the domain.

  • Mat