mpi + pgi directives question

Hi,

Two questions: 1. If I were to have part of a MPI code using CUDA, and other parts using PGI directives, is this gonna cause problems when I try to assign GPUs to an MPI process? For example, in the the 5x in 5hours article (Account Login | PGI) in the “set up code” section, would this assigni GPUs to each process just fine for both the CUDA and directives portion of the code?

  1. With reguards to the set up code mentioned above, when I add that code, add a call to setDevice like such:
    CALL MPI_INIT(ierr)
    CALL MPI_COMM_SIZE(MPI_COMM_WORLD, npp, ierr)
    CALL MPI_COMM_RANK(MPI_COMM_WORLD, me, ierr)
    nproc = npp
    IDPROC = me

devnum = setDevice(nproc,IDPROC)

Then when I add a region
!$acc region
!$acc do private(rhoy,rhox)
loop
!$acc end region

I get the runtime error
call to cuMemcpyDtoH returned error 700: Launch failed
CUDA driver version: 5000
call to cuMemcpyDtoH returned error 700: Launch failed
CUDA driver version: 5000

mpirun has exited due to process rank 5 with PID 4764 on
node dirac47-ib exiting without calling “finalize”. This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).



which I suspect is due to a error in the “set up” code I inserted. Is there any common problems that I may be having here?

Thanks,
Ben

Hi Ben,

  1. If I were to have part of a MPI code using CUDA, and other parts using PGI directives, is this gonna cause problems when I try to assign GPUs to an MPI process? For example, in the the 5x in 5hours article (> Account Login | PGI> ) in the “set up code” section, would this assigni GPUs to each process just fine for both the CUDA and directives portion of the code?

This should work, but with all the changes to our 2013 run time and the new CUDA versions, this is having some issue. The problem being that after you cudaSetDevice in the CUDA C portion of the code, the device isn’t getting initialized. Our engineer ask me to have you try adding any CUDA call (like cudaMalloc) after the call to cudaSetDevice, to get the CUDA run time to initialize the device.

which I suspect is due to a error in the “set up” code I inserted. Is there any common problems that I may be having here?

Possible, but it could be due to other reasons as well. Try the above work around and see if it fixes the problem.

  • Mat

Thanks Mat. I actually haven’t implemented CUDA and the directives together, but I was considering doing so and was just wondering what complications might occur in the process.

What I’m currently playing with is just a fortran MPI code and I’m only trying to add directives at the moment. Does the location that setDevice is called at matter, as long as its before the first accelerator region (and not like sitting in a loop or something)? Right now I just have it sitting in the subroutine with mpi_init. The accelerator regions are in a different subroutine, but I figure this doesn’t matter.

Ben

Hi Ben,

Does the location that setDevice is called at matter, as long as its before the first accelerator region (and not like sitting in a loop or something)?

It should be fine there (unless you’re using an old compiler like pre-10.6).

call to cuMemcpyDtoH returned error 700: Launch failed

This typically means the that kernel before the memcpy failed for some reason. Does the code run correctly without the directives enabled? (Be sure to guard the setDevice call with _OPENACC or _ACCEL macro)

  • Mat

It runs correctly without directives enabled.

When I run the code with the directives, but I remove the private directive, I don’t get the cuMemcpyDtoH error but the code instead hangs/gets stuck at the same place I would’ve gotten that error.

I am compiling with 12.3, but when I try to compile with 12.9 I get:

PGF90-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected flow graph (gem.f90: 503)

The code runs correctly after compiling with 12.9 with directives, but I don’t know if its actually being accelerated much because of the above message. The rest of accel info from the compiler is:

ppush:
504, Accelerator scalar kernel generated
505, Loop is parallelizable
583, Loop is parallelizable
654, Sum reduction generated for mynopi


Ben

I’ll need to see an reproducing example. Can you send one to PGI Customer Service (trs@prgoup.com) and ask them to send it t me?

Thanks,
Mat

Hi Mat, I sent trs smaller program that exhibits similar problems that hopefully won’t be too much for you to look at.

This TPR has been corrected in the 13.5 release, out now.

regards,
dave

Hi Mat,

I have a fairly large mpi code that uses dfftpack and lapack that we normally run on a NERSC computer. I am having some problems getting it to work with OACC. Is this code something you are capable or have time to look at? I understand that compiling and/or running an unfamiliar mpi code can be a huge hassle.

Ben

Hi Ben,

I’m actually giving a training session at NERSC on Thursday so just got a login to Carver/Dirac. Is there an easy way for users to share files? If I could access your existing build tree, that might be easiest thing to do.

Otherwise, I’m fine helping with larger codes provided you can walk me through what I need to do and have a well defined problem. Internally, I do have a medium size GPU cluster to work on, so can recreate larger problems here.

  • Mat