Problems on OpenMP and multi-GPU

Weixiao · August 13, 2012, 10:04pm

Greetings,

I’m a researcher on molecular dynamics, and I’m trying to write a CUDA-Fortran code to get the sum of thousands of pair potentials. As the number of pairs is huge in my system, I have to use multiple GPUs. The idea is to separate the whole system into two parts. I have two GPUs in my computational node, TESLA C2070 and GT 440, and I try to make them work together by OpenMP.

Now I get an error:
0: copyout Memcpy (host=0x7f49880016c0, dev=0x200300800, size=56) FAILED: 4(unspecified launch failure)

The main program is:

!$omp parallel &
	!$omp private(iCPU, iStart, iEnd, uTot, nBlockSP, prop)
	!$omp master
	nCPUs = omp_get_num_threads() 
	print*,'# of CPU(s):',nCPUs
	!$omp end master 
	!$omp barrier 
	
	iCPU = omp_get_thread_num()
	istat = cudaSetDevice(iCPU)
	call iniGPU() !! copy some parameters to the device
	call uptAllHostToDev() !! copy coordinates to the device
	!$omp barrier 
	

	if(iCPU == 0)then
		iStart = 1
		iEnd = nPartA
		nBlockSP = nBlocksA
	else
		iStart = nPartA + 1
		iEnd = nPartsTot
		nBlockSP = nBlocksB
	end if
	
	call getFullEnSP(iStart, iEnd, nBlockSP, uTot)
	print*,iCPU, uTot
	
	istat=cudaThreadExit() 
!$omp end parallel

where nBlocksA=7(on device 0: TESLA) and nBlocksB=2(on device 1 GT 440). And the subroutine ‘getFullEnSP’ is defined as:

subroutine getFullEnSP(iStart, iEnd, nBlockSP, uTot)
	
	real*8 :: uTot
	integer :: nBlockSP,iStart,iEnd
	
	integer :: iPart
	integer, device :: d_iPart, d_iStart, d_iEnd
	real*8, allocatable, dimension(:), device :: d_block
	real*8, allocatable, dimension(:) :: blockEnergy
	integer :: ierr
	
	allocate(blockEnergy(nBlockSP),d_block(nBlockSP))
	uTot = 0.d0
	blockEnergy = 0.d0
	d_block = 0.d0
	!! GPU total energy
	do iPart=1, nPartsTot
		d_iPart = iPart
		d_iStart = iStart
		d_iEnd = iEnd
		call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_block)
		blockEnergy(:) = d_block(:) !!!***ERROR IS HERE!***
		uTot = uTot + sum(blockEnergy)
	end do
	deallocate(blockEnergy,d_block)
	
	uTot = uTot / 2.d0
	print*,uTot
end subroutine

According to the error information, it is easy to locate the error in the subroutine:

blockEnergy(:) = d_block(:)
(Here 7size(real8) is 56.)

It’s OK if I use only one device, but once I try to use 2, here comes the error.

Does anyone has suggestions?

Thanks in advance

MatColgrove · August 14, 2012, 4:49pm

Hi Weixiao,

call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_block)
blockEnergy(:) = d_block(:) !!!ERROR IS HERE!

The error is most likely coming from your kernel and not the copy. Add a call to “cudaGetLastError” just after your kernel launch to see if you can catch this error.

Since the code does work with a single OpenMP thread, I’m guessing that there is some shared global device data that’s being used by both threads. This of course causes problems since multiple GPU context can’t share data.

If you can post a complete example of your code or send the full version to PGI Customer service (trs@pgroup.com), I can help find the problem. Otherwise, look at your kernel to see if it’s accessing any device module variables that have not been made OpenMP private.

Mat

Weixiao · August 14, 2012, 5:44pm

Hi, Mat!

Thanks a lot for your reply. I have sent the code, and please help to check it.

Best,
Weixiao.

MatColgrove · August 14, 2012, 9:39pm

Hi Weixiao,

I see a number of fundamental issues. First, you initialize the device and allocated data before entering the OpenMP region and setting the Cuda device. You need to wait to use the device before you set the context, otherwise a context is implicitly set.

Second, re-evaluate how you are doing your domain decomposition. It looks like of the you divide up the 4000 blocks with the first 3200 on the first GPU and 800 on the second. More importantly, you create an device arrays of the partitioned blocks but don’t adjust you indexing. Hence each openMP thread is is looping through 1 to nPartTot but each array is only a subset of nPartTot.

That’s all the time I have for today, and unfortunately the code still breaks. Hopefully you can work out the last problems.

Mat

Weixiao · August 15, 2012, 7:55pm

Hi, Mat!

Thanks for your last reply. I have fixed the bug that to initialize the device too early.

I found a naive way to solve the last problem:

Copy the kernel function ‘getPartEnSP’ and rename it as ‘getPartEnSP2’, then use

do iPart=1, nPartsTot
		d_iPart = iPart
		if(omp_get_thread_num == 0) then
			call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_nBlockSP, d_block)
		else
			call getPartEnSP2<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_nBlockSP, d_block)
		end if
		blockEnergy(:) = d_block(:)
		uTot = uTot + sum(blockEnergy)

end do

to run then seperately.

Although it is not an intelligent way, it could work somehow. Do you have any idea about reusing the kernel instead of renaming it?

Best,
Weixiao.

MatColgrove · August 15, 2012, 10:28pm

Hi Weixiao,

Since I don’t know your algorithm it hard to give you good advice on how to reorganize your code. However, hard coding the number of OpenMP threads is a poor choice. Ideally you want to decompose your problem amongst an arbitrary number of threads.

Can the do iPart loop be divided up amongst the threads? If I read your code correctly, you have each threads execute every part but then divide up the domain within the kernel itself. If you can do the domain decomposition higher up, then your kernel can just work on it portion of the domain.

Mat

Topic		Replies	Views
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14003	September 12, 2015
OpenMP + OpenACC problem Legacy PGI Compilers	9	5262	April 17, 2019
CUDA + OpenMP oddity - looks like a compiler bug. Legacy PGI Compilers	6	12159	April 12, 2010
OpenMP Multi-GPU, not getting speedup expected CUDA Programming and Performance	5	5843	July 15, 2011
CUDA_ERROR_ILLEGAL_ADDRESS with OpenMP "distribute parallel for" nvc, nvc++ and nvfortran	2	232	May 15, 2024
Using multiple GPUs Legacy PGI Compilers	7	22079	August 11, 2009
Is it possible to use both OpenMP + CUDA in PGI fortran ? Legacy PGI Compilers	4	7698	December 18, 2010
OpenMP + different GTX GPUs + Driver > v391.35 (Win 10 / Win 7) CUDA Programming and Performance	15	1545	August 31, 2018
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3948	January 28, 2022
A little help with Multi-GPU example please :) How do I pass data to each GPU? CUDA Programming and Performance	8	28003	March 4, 2012

Problems on OpenMP and multi-GPU

Related topics