CUDA+MPI error on workstation

zsh1 · December 6, 2012, 2:10am

hi
my compiler version is 64bits 11.7 PGI.Acc.Fortran, and when i do with cuda+mpi work on 64 bits workstations i encountered an problem.
it’s an seismic migration code, and it do many shots cycles and within every shots thousands of timesteps has to calculate. its the background of the code.
at first i checked the program to calculate only 10 timesteps or hunderds timesteps, and it done. but when i give an real calculate timestep about 6000 value, which makes the calculation time is long and the error happened:
killed by single 2
p0_31083: p4_error: net_recv read: probable EOF on socket: 1
p0_31083: (33208.406250) net_send: could not write to fd=4, errno = 32
*=============
the command line is
pgfortran -Mcuda -Mmpi -o mpi mpi.f90
mpirun -np 3 mpi >a.dat&
*==============
the code is :

program RTM
use cudafor
include ‘mpif.h’
here is parameters define****
call MPI_INIT(IEER)
call MPI_COMM_SIZE(MPI_COMM_WORLD, NUMPROCS, IEER)
call MPI_COMM_RANK(MPI_COMM_WORLD, MYID, IEER)
here read some files*****
ierr=cudaGetDeviceCount(numdev)
ierr=cudasetdevice(myid)
call subroutine cal(parameters)
call MPI_FINALIZE(IEER)
end

subroutine cal(parameters)
use cudafor
include ‘mpif.h’
here is some parameters’ calculation
do ishots=1+myid,nshots,numprocs (shots cycle)
do it=1,max_timesteps
call gpu subroutines
host array = device array
enddo
write the result to the disk
enddo
end subroutine
*===========================

thanks, if someone can help to solve the problem.

MatColgrove · December 6, 2012, 9:07pm

Hi zsh,

Is there another message above the signal 2? This just indicates that one of the MPI process encountered some problem and was terminated with an interrupt signal. You’ll need to do more digging to figure out what the actual error is. That should help narrowing down how to determine the cause. From the information given it could be anything.

Mat

zsh1 · December 20, 2012, 2:10am

hi mat
these days i tried to found out what caused the problem from many ways, and checked out that was memroy. i expanded the memroy form 32GB to 64GB, my code can be done. but it does not resolve the problem basiclly. when i expanded the calculate scale or add the calclulate nodes, it still happen.
so i paied attention to memroy use during calclutation, and i found out the used memory becomes larger and larger though time elapse, but the code does not allocate or used so many memory amount.
i think may be it is memory leak, but i am confused because i deallocated all the allocated memory.
so, can you give me some idea about how to check which part of code casued memory leak?
thanks !

MatColgrove · December 20, 2012, 6:46pm

Does the problem still occur if you use 1 process? For memory issues like these, I typically use Valgrind (www.valgrind.org) but they only have limited support for multiprocess MPI.

Mat

zsh1 · December 21, 2012, 2:43am

1 process is ok without any problem. the error happend only when i use multiprocess. so i believe this analysis tool would be helpful.
thanks for your apply!

Topic		Replies	Views
MPI error Legacy PGI Compilers	2	2119	September 27, 2012
[CUF] cuda-aware mpi send/recv segfault with cuda-memcheck Legacy PGI Compilers	7	3778	October 10, 2018
memory leak in cuda-Aware MPI fortran program Legacy PGI Compilers	1	1896	April 26, 2018
Memory access error when using cuda+mpi Legacy PGI Compilers	4	3664	March 27, 2018
about running cuda on a gpu cluster CUDA Programming and Performance	25	21887	May 31, 2010
mpi + cuda problems on mpi init CUDA Programming and Performance	0	874	June 11, 2010
CUDA Fortran Book Memory Allocation Error Legacy PGI Compilers	5	4058	April 18, 2020
mpi for loop run cuda crashed CUDA Programming and Performance	4	1814	May 17, 2010
Compiling the CUDA_MPI sample CUDA Programming and Performance	0	972	February 25, 2013
mpif90+cudaC Mixed compilation problems Legacy PGI Compilers	4	1144	November 22, 2019

CUDA+MPI error on workstation

Related topics