I’m a researcher on molecular dynamics, and I’m trying to write a CUDA-Fortran code to get the sum of thousands of pair potentials. As the number of pairs is huge in my system, I have to use multiple GPUs. The idea is to separate the whole system into two parts. I have two GPUs in my computational node, TESLA C2070 and GT 440, and I try to make them work together by OpenMP.
Now I get an error:
0: copyout Memcpy (host=0x7f49880016c0, dev=0x200300800, size=56) FAILED: 4(unspecified launch failure)
The main program is:
!$omp parallel & !$omp private(iCPU, iStart, iEnd, uTot, nBlockSP, prop) !$omp master nCPUs = omp_get_num_threads() print*,'# of CPU(s):',nCPUs !$omp end master !$omp barrier iCPU = omp_get_thread_num() istat = cudaSetDevice(iCPU) call iniGPU() !! copy some parameters to the device call uptAllHostToDev() !! copy coordinates to the device !$omp barrier if(iCPU == 0)then iStart = 1 iEnd = nPartA nBlockSP = nBlocksA else iStart = nPartA + 1 iEnd = nPartsTot nBlockSP = nBlocksB end if call getFullEnSP(iStart, iEnd, nBlockSP, uTot) print*,iCPU, uTot istat=cudaThreadExit() !$omp end parallel
where nBlocksA=7(on device 0: TESLA) and nBlocksB=2(on device 1 GT 440). And the subroutine ‘getFullEnSP’ is defined as:
subroutine getFullEnSP(iStart, iEnd, nBlockSP, uTot) real*8 :: uTot integer :: nBlockSP,iStart,iEnd integer :: iPart integer, device :: d_iPart, d_iStart, d_iEnd real*8, allocatable, dimension(:), device :: d_block real*8, allocatable, dimension(:) :: blockEnergy integer :: ierr allocate(blockEnergy(nBlockSP),d_block(nBlockSP)) uTot = 0.d0 blockEnergy = 0.d0 d_block = 0.d0 !! GPU total energy do iPart=1, nPartsTot d_iPart = iPart d_iStart = iStart d_iEnd = iEnd call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_block) blockEnergy(:) = d_block(:) !!!***ERROR IS HERE!*** uTot = uTot + sum(blockEnergy) end do deallocate(blockEnergy,d_block) uTot = uTot / 2.d0 print*,uTot end subroutine
According to the error information, it is easy to locate the error in the subroutine:
blockEnergy(:) = d_block(:)
(Here 7size(real8) is 56.)
It’s OK if I use only one device, but once I try to use 2, here comes the error.
Does anyone has suggestions?
Thanks in advance