CUDA-aware MPI on GPUs running display for Pascal and up

caplanr · March 6, 2019, 9:28pm

Hi,

I have come across a problem that I think may be an issue with either the compiler or OpenMPI.

I have a multi-GPU MPI+OpenACC code that uses CUDA-aware MPI through the host_data clause.

For testing purposes, I have in the past been able to run the code using 2 GPUs on a machine that had a GTX 750TI and a TitanXP using both GPUs. In that case the 750TI was also being used to run the graphics windowing system (MATE).

My system now has a GTX 1050TI and a RTX 2080TI with the 1050TI running the graphics. The code now crashes when trying to use both GPUs (or just the 1050TI).

On another machine, I have a single RTX 2070 that runs the graphics. Running the code on that single GPU also crashes in the same manner. If I disable my windowing system (server mode), the code runs fine. [Note that the CUDA-aware MPI is still being used even with 1 GPU due to a periodic domain seam).

The only common denominator I can see is that using CUDA-aware MPI on a GPU that is also running graphics seems to not work when the GPU is Pascal or above (since it DID work with the 750TI).

The crashes happen shortly into the run but not right away, taking a random number of steps before it happens.

On the system with the single RTX 2070 running graphics, all the CUDA 10.1 sample programs ran fine including the multiple GPU tests. This leads me to think it is an openmpi or pgi issue.

All systems were running Linux Mint 19.

The crash spits out:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

call to cuMemFreeHost returned error 700: Illegal address during kernel execution

I know it is not necessarily common to run computation and graphics at the same time but it is useful for testing.

I think the POT3D code I have previously sent you could reproduce this problem by switching between cuda-aware MPI or not.

MatColgrove · March 7, 2019, 6:11pm

Hi Ron,

So this one is going to take a bit of digging and time.

Officially, PGI only supports the Telsa line of NVIDIA GPU. Often, the GeForce line will also work since they share the same architecture and CUDA drivers as the Tesla line, but we don’t test every GeForce card.

We do test on GTX1070 and GTX1080 cards (no RTX’s though), but none are used with display mode enabled. They are just used for compute. I’ve ask my IT folks to set something up for me that I can use to test. Might be a few days.

I’ll do my best to recreate the error and if possible determine the cause.

-Mat

MatColgrove · March 7, 2019, 9:34pm

Hi Ron,

My IT folks were able to setup a GTX1080 system for me with display enabled. However for good or bad, I’m not able to reproduce the error. I’ve tried with PGI 18.10 with OpenMPI 2.1.2 and PGI 19.1 with OpenMPI 3.1.3, and both run correctly with CUDA Aware MPI. I also ran with 4 ranks, all using the same GPU, but code ran fine.

My system has CUDA 10 installed.

Unfortunately, you’re going to have dig into this more and determine what the issue is, or at least understand which component is causing the problem.

-Mat

caplanr · March 8, 2019, 5:39pm

Hi,

Thanks for looking into this.

The code I am running uses CUDA-aware MPI on a derived type array (such as a%x) (although “a%x” is passed to the subroutine instead of “a” due to the (still not fixed …) bug in PGI.

Could this have something to do with it?

P.S. Any updates on that bug? I have not tried testing it on 19.1 yet because the changelog doesn’t seem to mention any fix for it.

Ron

MatColgrove · March 8, 2019, 11:16pm

Could this have something to do with it?

Not sure, but I did run with CUDA Aware MPI enabled.

TPR#26191 that you reported in Code crashing with 18.7 - worked with 18.4 should be fixed in PGI 19.1. Though this one had to do with us changing the default to using -Mallocate=03.

TPR#25243, which is what I think you’re referring still isn’t fixed. I added another plea in the comments. I’ll try to push harder but there’s only so much I can do. Definitely pester Michael at GTC since the bug is currently assigned to him.

-Mat

Topic		Replies	Views
problem with multi gpu using mpi Legacy PGI Compilers	2	2171	December 2, 2015
[CUF] cuda-aware mpi send/recv segfault with cuda-memcheck Legacy PGI Compilers	7	3614	October 10, 2018
MPI+CUDA Fortran Legacy PGI Compilers	4	9001	May 16, 2017
Openacc with cuda nvc, nvc++ and nvfortran cuda	4	394	April 22, 2023
nvidia-smi and PGI Legacy PGI Compilers	4	3766	December 19, 2017
Using multiple GPUs Legacy PGI Compilers	7	22072	August 11, 2009
CUDA+MPI = Unexplained Issues... Random Crashes, Errenous Output?!? CUDA Programming and Performance	5	3250	July 7, 2008
Program crashes with wrong ta Legacy PGI Compilers	3	4011	May 29, 2020
CUDA aware MPI CUDA Setup and Installation	3	1542	April 14, 2016
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	23620	July 27, 2010

CUDA-aware MPI on GPUs running display for Pascal and up

Related topics