CUDA+MPI = Unexplained Issues... Random Crashes, Errenous Output?!?

I’m running a particular program in CUDA, and use MPI to create multiple processes of the same program to run on the different CPU cores, all of which are accessing the same GPU. Each program does the following:

  1. Initialize MPI, CUDA
  2. Get process rank and total no. of processes
  3. perform CUDA Malloc
  4. For N times do
    4.1 cudaMemcpy (Host to Device)
    4.2 Execute a Kernel on a portion of the data set
    4.3 cudaMemcpy (Device to Host)
    4.4 End Loop (No.4)
  5. Perform CPU Compute
  6. Verify Results
  7. Perform Barrier Sync
  8. Use MPI I/O calls to write timestamps of various events from a string buffer to file
  9. Free memory and Cleanup

So I execute this program on varying input and various iterations using a bash script to execute mpirun some 200 or 500 times. The problems I face are:

  1. Occasional Errors in output (non-reproducible) occurring seemingly at random iterations. Sometimes output is garbage, or sometimes all zeros.

  2. Sometimes the process freezes before crashing.

  3. On even rarer occasions, the system completely crashes, forcing me to do a cold reboot. While trying the program in text mode, I got the following message while running the program, just before It crashed

NVRM: Xid (0060:00) 13, 0004 00000000 000050c0 00000368 00000000 00000080

Each program execution works on a small amount of data (a cudamemcopy in the program copies not more than 2 KB of data to the GPU).

I’m using an NVIDIA Quadro FX 5600 on a dual Zeon quad-core workstation, running RHEL4, CUDA 1.1 SDK and Toolkit, default OpenMPI package (1.1.1).

The NVIDIA driver I’m using is v169.04

Does anyone have any Idea whats happening?

What kind of errors do you get?

This, in general, does not work very well. When it does work, you will only get 1-10% of the capability of the GPU. Other times, the driver may even bring the whole machine down.

Is there any particular reason you want to have all CPU cores using the GPU in different contexts simultaneously?

Sorry people, I didn’t really complete the post, I left it midway to do a double check on the program, and didn’t realize that the incomplete post was submitted, do take a look at the start of the thread as I’ve completed the post now.

Sorry Again!

Can anyone throw some more light on this topic? Any help would be much appreciated.

Upon rereading, I realize what I was saying in the first post isn’t very clear. :)

Currently, having the GPU accessed by many CUDA jobs simultaneously is not well-supported. The driver will often do its best to accommodate all of the requests, but unfortunately it’s not designed to do it. Having all of the requests happening at once tends to pull the driver in many different directions, and it tries to timeslice between the CUDA jobs. The result is poor performance overall, especially per-process, and even general system instability.

CUDA works best when you have a single thread accessing the GPU. If you are attempting to develop/debug a distributed memory CUDA program, it might make sense to make a wrapper which instead of directly performing calls on the GPU instead transfers the data to another process standing by outside of the MPI job which handles the GPU work.

I learned this through experience about a year ago, so please ask more specific questions if you have any.