Greetings,
I’m a researcher on molecular dynamics, and I’m trying to write a CUDA-Fortran code to get the sum of thousands of pair potentials. As the number of pairs is huge in my system, I have to use multiple GPUs. The idea is to separate the whole system into two parts. I have two GPUs in my computational node, TESLA C2070 and GT 440, and I try to make them work together by OpenMP.
Now I get an error:
0: copyout Memcpy (host=0x7f49880016c0, dev=0x200300800, size=56) FAILED: 4(unspecified launch failure)
The main program is:
!$omp parallel &
!$omp private(iCPU, iStart, iEnd, uTot, nBlockSP, prop)
!$omp master
nCPUs = omp_get_num_threads()
print*,'# of CPU(s):',nCPUs
!$omp end master
!$omp barrier
iCPU = omp_get_thread_num()
istat = cudaSetDevice(iCPU)
call iniGPU() !! copy some parameters to the device
call uptAllHostToDev() !! copy coordinates to the device
!$omp barrier
if(iCPU == 0)then
iStart = 1
iEnd = nPartA
nBlockSP = nBlocksA
else
iStart = nPartA + 1
iEnd = nPartsTot
nBlockSP = nBlocksB
end if
call getFullEnSP(iStart, iEnd, nBlockSP, uTot)
print*,iCPU, uTot
istat=cudaThreadExit()
!$omp end parallel
where nBlocksA=7(on device 0: TESLA) and nBlocksB=2(on device 1 GT 440). And the subroutine ‘getFullEnSP’ is defined as:
subroutine getFullEnSP(iStart, iEnd, nBlockSP, uTot)
real*8 :: uTot
integer :: nBlockSP,iStart,iEnd
integer :: iPart
integer, device :: d_iPart, d_iStart, d_iEnd
real*8, allocatable, dimension(:), device :: d_block
real*8, allocatable, dimension(:) :: blockEnergy
integer :: ierr
allocate(blockEnergy(nBlockSP),d_block(nBlockSP))
uTot = 0.d0
blockEnergy = 0.d0
d_block = 0.d0
!! GPU total energy
do iPart=1, nPartsTot
d_iPart = iPart
d_iStart = iStart
d_iEnd = iEnd
call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_block)
blockEnergy(:) = d_block(:) !!!***ERROR IS HERE!***
uTot = uTot + sum(blockEnergy)
end do
deallocate(blockEnergy,d_block)
uTot = uTot / 2.d0
print*,uTot
end subroutine
According to the error information, it is easy to locate the error in the subroutine:
blockEnergy(:) = d_block(:)
(Here 7size(real8) is 56.)
It’s OK if I use only one device, but once I try to use 2, here comes the error.
Does anyone has suggestions?
Thanks in advance