CUDA-aware MPI on GPUs running display for Pascal and up

Hi,

I have come across a problem that I think may be an issue with either the compiler or OpenMPI.

I have a multi-GPU MPI+OpenACC code that uses CUDA-aware MPI through the host_data clause.

For testing purposes, I have in the past been able to run the code using 2 GPUs on a machine that had a GTX 750TI and a TitanXP using both GPUs. In that case the 750TI was also being used to run the graphics windowing system (MATE).

My system now has a GTX 1050TI and a RTX 2080TI with the 1050TI running the graphics. The code now crashes when trying to use both GPUs (or just the 1050TI).

On another machine, I have a single RTX 2070 that runs the graphics. Running the code on that single GPU also crashes in the same manner. If I disable my windowing system (server mode), the code runs fine. [Note that the CUDA-aware MPI is still being used even with 1 GPU due to a periodic domain seam).

The only common denominator I can see is that using CUDA-aware MPI on a GPU that is also running graphics seems to not work when the GPU is Pascal or above (since it DID work with the 750TI).

The crashes happen shortly into the run but not right away, taking a random number of steps before it happens.

On the system with the single RTX 2070 running graphics, all the CUDA 10.1 sample programs ran fine including the multiple GPU tests. This leads me to think it is an openmpi or pgi issue.

All systems were running Linux Mint 19.

The crash spits out:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

call to cuMemFreeHost returned error 700: Illegal address during kernel execution

I know it is not necessarily common to run computation and graphics at the same time but it is useful for testing.

I think the POT3D code I have previously sent you could reproduce this problem by switching between cuda-aware MPI or not.

Hi Ron,

So this one is going to take a bit of digging and time.

Officially, PGI only supports the Telsa line of NVIDIA GPU. Often, the GeForce line will also work since they share the same architecture and CUDA drivers as the Tesla line, but we don’t test every GeForce card.

We do test on GTX1070 and GTX1080 cards (no RTX’s though), but none are used with display mode enabled. They are just used for compute. I’ve ask my IT folks to set something up for me that I can use to test. Might be a few days.

I’ll do my best to recreate the error and if possible determine the cause.

-Mat

Hi Ron,

My IT folks were able to setup a GTX1080 system for me with display enabled. However for good or bad, I’m not able to reproduce the error. I’ve tried with PGI 18.10 with OpenMPI 2.1.2 and PGI 19.1 with OpenMPI 3.1.3, and both run correctly with CUDA Aware MPI. I also ran with 4 ranks, all using the same GPU, but code ran fine.

My system has CUDA 10 installed.

Unfortunately, you’re going to have dig into this more and determine what the issue is, or at least understand which component is causing the problem.

-Mat

Hi,

Thanks for looking into this.

The code I am running uses CUDA-aware MPI on a derived type array (such as a%x) (although “a%x” is passed to the subroutine instead of “a” due to the (still not fixed …) bug in PGI.

Could this have something to do with it?

P.S. Any updates on that bug? I have not tried testing it on 19.1 yet because the changelog doesn’t seem to mention any fix for it.

  • Ron

Could this have something to do with it?

Not sure, but I did run with CUDA Aware MPI enabled.

TPR#26191 that you reported in Code crashing with 18.7 - worked with 18.4 should be fixed in PGI 19.1. Though this one had to do with us changing the default to using -Mallocate=03.

TPR#25243, which is what I think you’re referring still isn’t fixed. I added another plea in the comments. I’ll try to push harder but there’s only so much I can do. Definitely pester Michael at GTC since the bug is currently assigned to him.

-Mat