CUDA+MPI error on workstation

hi
my compiler version is 64bits 11.7 PGI.Acc.Fortran, and when i do with cuda+mpi work on 64 bits workstations i encountered an problem.
it’s an seismic migration code, and it do many shots cycles and within every shots thousands of timesteps has to calculate. its the background of the code.
at first i checked the program to calculate only 10 timesteps or hunderds timesteps, and it done. but when i give an real calculate timestep about 6000 value, which makes the calculation time is long and the error happened:
killed by single 2
p0_31083: p4_error: net_recv read: probable EOF on socket: 1
p0_31083: (33208.406250) net_send: could not write to fd=4, errno = 32
*=============
the command line is
pgfortran -Mcuda -Mmpi -o mpi mpi.f90
mpirun -np 3 mpi >a.dat&
*==============
the code is :

program RTM
use cudafor
include ‘mpif.h’
here is parameters define****
call MPI_INIT(IEER)
call MPI_COMM_SIZE(MPI_COMM_WORLD, NUMPROCS, IEER)
call MPI_COMM_RANK(MPI_COMM_WORLD, MYID, IEER)
here read some files*****
ierr=cudaGetDeviceCount(numdev)
ierr=cudasetdevice(myid)
call subroutine cal(parameters)
call MPI_FINALIZE(IEER)
end

subroutine cal(parameters)
use cudafor
include ‘mpif.h’
here is some parameters’ calculation
do ishots=1+myid,nshots,numprocs (shots cycle)
do it=1,max_timesteps
call gpu subroutines
host array = device array
enddo
write the result to the disk
enddo
end subroutine
*===========================

thanks, if someone can help to solve the problem.

Hi zsh,

Is there another message above the signal 2? This just indicates that one of the MPI process encountered some problem and was terminated with an interrupt signal. You’ll need to do more digging to figure out what the actual error is. That should help narrowing down how to determine the cause. From the information given it could be anything.

  • Mat

hi mat
these days i tried to found out what caused the problem from many ways, and checked out that was memroy. i expanded the memroy form 32GB to 64GB, my code can be done. but it does not resolve the problem basiclly. when i expanded the calculate scale or add the calclulate nodes, it still happen.
so i paied attention to memroy use during calclutation, and i found out the used memory becomes larger and larger though time elapse, but the code does not allocate or used so many memory amount.
i think may be it is memory leak, but i am confused because i deallocated all the allocated memory.
so, can you give me some idea about how to check which part of code casued memory leak?
thanks !

Does the problem still occur if you use 1 process? For memory issues like these, I typically use Valgrind (www.valgrind.org) but they only have limited support for multiprocess MPI.

  • Mat

1 process is ok without any problem. the error happend only when i use multiprocess. so i believe this analysis tool would be helpful.
thanks for your apply!