We have a multi-GPU system having two hexa core CPU and two kepler K20 cards. The cards are connected to CPU 1 over the same PCI bus.
I notice that the code crashes with the following error when the two GPUs are performing a GPU direct a send and receive over the PCI bus.
“The call to cuEventRecord failed. This is a unrecoverable error and will cause the program to abort.
cuEventRecord return value: 709 Check the cuda.h file for what the return value means”.
I am quite confident that the implementation is correct because its running well for a few runs and crashes for some.
Also initially we didn’t have CPU 2 and everything was working fine. After we installed the second CPU, I have been seeing this error. There is no much information on the internet about this error.
Also I am restricting both MPI ranks to CPU 1 by specifying slots=1 in the host file list.
Any light on this issue will be highly appreciated.