pgf90 + openacc & managed memory / um-evaluation package

escj · May 20, 2015, 9:28am

Hello , I’ve downloaded the pgi/15.4 + managed/um-evaluation package .

Some library/link-flag are missing in the pgf90 flag when using -ta=telsa:managed .

pgf90 -ta=host,tesla:> managed > hello_openacc.f90 -o hello_openacc_managed
/tmp/pgf90KZJSN8umD_5.o: In function hello': /home/escj/dir_OPENACC/./hello_openacc.f90:3: undefined reference to pgf90_man_alloc04’
/home/escj/dir_OPENACC/./hello_openacc.f90:7: undefined reference to __pgi_uacc_cuda_launchk2' /tmp/pgf90KZJSN8umD_5.o: In function …cuda_fortran_constructor_1’:
/home/escj/dir_OPENACC/./hello_openacc.f90:15: undefined reference to __pgi_cuda_register_fat_binary' /home/escj/dir_OPENACC/./hello_openacc.f90:15: undefined reference to __cudaRegisterFunction’
pgacclnk: child process exit status 1: /usr/bin/ld

Adding -Mcuda , solve the problem

pgf90 > -Mcuda > -ta=host,tesla:managed hello_openacc.f90 -o hello_openacc_managed

After this, is work correctly
( but performance are very poor for example in a STREAM benchmark, as I understood data are copied in/out at every launch kernel )

Bye

Juan

MatColgrove · May 20, 2015, 4:07pm

Hi Juan,

Yes, to use Unified Memory with Fortran, you do need to add the “-Mcuda”.

but performance are very poor for example in a STREAM benchmark, as I understood data are copied in/out at every launch kernel

Not unexpected since UM is meant to help with the initial porting effort. However, the data only gets copied went it’s been modified. So if you’re touching the data on the host and the device, then yes it will get copied back and forth with each kernel launch. If you only touch the data on either the host or device, then it doesn’t get copied.

Mat

escj · May 22, 2015, 12:46pm

Hello Mat .

;-) The option -Mcuda was not indicated in the PGI news
Account Login | PGI
(neither in the README coming with the um-eval package )
;-) I found the problem of the poor performance of my STREAM benchmark …

I have 2 TITAN cards on my test machine !

pgaccelinfo -short
0 GeForce GTX TITAN
1 GeForce GTX TITAN

Setting CUDA_VISIBLE_DEVICES=0 speedup the performance near the optimal one .

Without CUDA_VISIBLE_DEVICES

stream_pgi154_acc_ompi175_manage
...
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:        17502.01565974     ...
Scale:       17508.98335551     ...
Add:         11480.43242445     ...
Triad:       11383.99349557     ...

With CUDA_VISIBLE_DEVICES=0

CUDA_VISIBLE_DEVICES=0 stream_pgi154_acc_ompi175_manage
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:        222708.45205104 
Scale:       221631.04772747
Add:         222303.20011646
Triad:       222423.12092569

So good for a first try !

Bye

Juan

escj · May 22, 2015, 1:03pm

One question left

With GPU DIRECT aware MPI , does the !$acc host_data use_device
work/must be used for managed array ?

otherwise how the system/library know which of the host/device version of the array must be transferred via MPI ?

Bye Juan

MatColgrove · May 22, 2015, 7:11pm

Hi Juan,

I’m not sure. I wouldn’t think using “host_data” is needed, but question if CUDA UM works with GPU Direct.

Let me as some folks and get back to you.

Mat

MatColgrove · May 22, 2015, 10:51pm

Hi Juan,

Here’s the response I got:

If managed memory is used host_data use_device is not necessary (it would not change the value of the pointer anyway). However the MPI implementation needs to be CUDA-aware and aware of unified memory. CUDA-aware builds of OpenMPI 1.8.5, which was released a couple of days ago, is aware of unified memory. If he needs to run on with another or older CUDA-aware MPI implementation he needs to make sure that CUDA IPC (GPUDirect P2P) and GPUDirect RDMA is disable, e.g. for OpenMPI this can be achieved with:

mpiexec --mca btl_smcuda_use_cuda_ipc 0 --mca btl_smcuda_use_cuda_ipc_same_gpu 0

(Explicitly disabling GPUDirect RDMA is not necessary with OpenMPI because it is off by default).

Hope this helps,
Mat

escj · June 15, 2015, 8:18am

Hello Mat .

With the last pgi/15.5 , I have tested the -ta:tesla:managed memory with openmpi1.8.5 CUDA-aware ( and also with mvapich2-1.a-gdr) .

Apparently activating the flag managed doesn’t inhibited the host_data clause which generate segmentation fault .

Here is an hello_manage.f90 example :

=> Process 0 send 1 to process 1 on GPU . Proc 1 write the result on CPU

program hello_managed

implicit none
include ‘mpif.h’

integer, parameter :: n=256
real , allocatable, dimension(:) :: send_buf,recv_buf
integer :: npe,mype,ierr

call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, mype, ierr)
call mpi_comm_size(mpi_comm_world, npe, ierr)

if (npe.ne.2) STOP ‘run with 2 MPI task only’

allocate( send_buf(n),recv_buf(n) )

!$acc data create(send_buf,recv_buf)

!$acc kernels
send_buf=1.0
recv_buf=-999.0
!$acc end kernels

!$acc host_data use_device(send_buf,recv_buf)
if ( mype .eq. 0 ) then
call MPI_Send(send_buf,n,MPI_REAL,1,0,mpi_comm_world, ierr)
else
call MPI_Recv(recv_buf,n,MPI_REAL,0,0,mpi_comm_world,mpi_status_ignore, ierr)
endif
!$acc end host_data

if ( mype .eq. 1 ) then
!$acc update host(recv_buf(n:n))
print*,‘mype=’,mype,’ recv_buf(n) <must be 1> =',recv_buf(n)
end if

!$acc end data

call mpi_finalize(ierr)

end program hello_managed

=> Compiled without the managed flag , it work correctly <=> thank’s to MPI CUDA-aware

mpif90 -ta=tesla:cuda6.5 hellompi_managed.f90 -o hellompi_tesla
mpirun -np 2 hellompi_tesla
mype= 1 recv_buf(n) <must be 1> = > 1.000000

=> With the managed flag the code crash

mpif90 -Mcuda -ta=tesla:cuda6.5,> managed > hellompi_managed.f90 -o hellompi_managed
mpirun -np 2 hellompi_managed

mpirun noticed that process rank 0 with PID 6225 on node n370 exited on > signal 11 (Segmentation fault)> .

=> Removing the host_data make the code run correctly with the managed flag :

mpif90 -Mcuda -ta=tesla:cuda6.5,managed hellompi_managed.f90 -o hellompi_managed_no_hostdata
mpirun -np 2 hellompi_managed_no_hostdata
mype= 1 recv_buf(n) <must be 1> = > 1.000000

=> Last remark, the update clause is not deactivated to ( by create clause are ) , as shown by the -Minfo=acc

mpif90 -Mcuda -ta=tesla:cuda6.5,> managed > -Minfo=acc hellompi_managed.f90 -o hellompi_managed
hello_managed:
21, Loop is parallelizable
Accelerator kernel generated
21, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
34, > Generating update host> (recv_buf(256)

Bye Juan .

MatColgrove · June 15, 2015, 11:27pm

Thanks Juan.

I know in the 15.1 compiler we had a similar issue with host_data, but I thought we had fixed it in 15.2. Let me try and reproduce the problem here. I may be a different issue or we missed a case.

Mat

MatColgrove · June 16, 2015, 8:58pm

Hi Juan,

I was able to recreate the issue when using OpenMPI 1.8.5 built with RDMA and CUDA 6.5. Interestingly, when I switched to using MPICH2 gdr with CUDA 7.0, the test passed.

I’m still tracking things down, but it could be a CUDA 6.5 or PGI issue. Not sure which. My next step is to build OpenMPI with CUDA 7.0 and see what happens.

Mat

Topic		Replies	Views
Segfault with MPI_Send + acc_malloc Legacy PGI Compilers	3	2581	April 8, 2020
Unable to import PGPROF generated profile data Legacy PGI Compilers	13	6183	February 25, 2018
MPICH linking failing Legacy PGI Compilers	12	12593	October 25, 2013
Request support/help for PBS with OpenMPI Legacy PGI Compilers	21	14839	August 9, 2022
Call to collective mpi subroutine with openacc host_data directive Legacy PGI Compilers	8	986	March 26, 2021
[CUF] cuda-aware mpi send/recv segfault with cuda-memcheck Legacy PGI Compilers	7	3613	October 10, 2018
PGPROF giving error of incompatible CUDA driver, except when using unified memory with pgc++ Legacy PGI Compilers	10	8979	January 24, 2020
Errors Linking with PGI 16.9 Legacy PGI Compilers	6	4882	November 4, 2016
Error when running optimized code but runs fine with debug nvc, nvc++ and nvfortran	4	1308	August 30, 2022
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	15832	July 19, 2013

pgf90 + openacc & managed memory / um-evaluation package

mpif90 -Mcuda -ta=tesla:cuda6.5,> managed > hellompi_managed.f90 -o hellompi_managed
mpirun -np 2 hellompi_managed

mpirun noticed that process rank 0 with PID 6225 on node n370 exited on > signal 11 (Segmentation fault)> .

pgf90 + openacc & managed memory / um-evaluation package

mpif90 -Mcuda -ta=tesla:cuda6.5,> managed > hellompi_managed.f90 -o hellompi_managed mpirun -np 2 hellompi_managed

mpirun noticed that process rank 0 with PID 6225 on node n370 exited on > signal 11 (Segmentation fault)> .

Related topics

mpif90 -Mcuda -ta=tesla:cuda6.5,> managed > hellompi_managed.f90 -o hellompi_managed
mpirun -np 2 hellompi_managed