I’m running a particular program in CUDA, and use MPI to create multiple processes of the same program to run on the different CPU cores, all of which are accessing the same GPU. Each program does the following:
- Initialize MPI, CUDA
- Get process rank and total no. of processes
- perform CUDA Malloc
- For N times do
4.1 cudaMemcpy (Host to Device)
4.2 Execute a Kernel on a portion of the data set
4.3 cudaMemcpy (Device to Host)
4.4 End Loop (No.4)
- Perform CPU Compute
- Verify Results
- Perform Barrier Sync
- Use MPI I/O calls to write timestamps of various events from a string buffer to file
- Free memory and Cleanup
So I execute this program on varying input and various iterations using a bash script to execute mpirun some 200 or 500 times. The problems I face are:
Occasional Errors in output (non-reproducible) occurring seemingly at random iterations. Sometimes output is garbage, or sometimes all zeros.
Sometimes the process freezes before crashing.
On even rarer occasions, the system completely crashes, forcing me to do a cold reboot. While trying the program in text mode, I got the following message while running the program, just before It crashed
NVRM: Xid (0060:00) 13, 0004 00000000 000050c0 00000368 00000000 00000080
Each program execution works on a small amount of data (a cudamemcopy in the program copies not more than 2 KB of data to the GPU).
I’m using an NVIDIA Quadro FX 5600 on a dual Zeon quad-core workstation, running RHEL4, CUDA 1.1 SDK and Toolkit, default OpenMPI package (1.1.1).
The NVIDIA driver I’m using is v169.04
Does anyone have any Idea whats happening?