pgf90 + openacc & managed memory / um-evaluation package

Hello , I’ve downloaded the pgi/15.4 + managed/um-evaluation package .

Some library/link-flag are missing in the pgf90 flag when using -ta=telsa:managed .

pgf90 -ta=host,tesla:> managed > hello_openacc.f90 -o hello_openacc_managed
/tmp/pgf90KZJSN8umD_5.o: In function hello': /home/escj/dir_OPENACC/./hello_openacc.f90:3: undefined reference to pgf90_man_alloc04’
/home/escj/dir_OPENACC/./hello_openacc.f90:7: undefined reference to __pgi_uacc_cuda_launchk2' /tmp/pgf90KZJSN8umD_5.o: In function …cuda_fortran_constructor_1’:
/home/escj/dir_OPENACC/./hello_openacc.f90:15: undefined reference to __pgi_cuda_register_fat_binary' /home/escj/dir_OPENACC/./hello_openacc.f90:15: undefined reference to __cudaRegisterFunction’
pgacclnk: child process exit status 1: /usr/bin/ld

Adding -Mcuda , solve the problem

pgf90 > -Mcuda > -ta=host,tesla:managed hello_openacc.f90 -o hello_openacc_managed

After this, is work correctly
( but performance are very poor for example in a STREAM benchmark, as I understood data are copied in/out at every launch kernel )

Bye

Juan

Hi Juan,

Yes, to use Unified Memory with Fortran, you do need to add the “-Mcuda”.

but performance are very poor for example in a STREAM benchmark, as I understood data are copied in/out at every launch kernel

Not unexpected since UM is meant to help with the initial porting effort. However, the data only gets copied went it’s been modified. So if you’re touching the data on the host and the device, then yes it will get copied back and forth with each kernel launch. If you only touch the data on either the host or device, then it doesn’t get copied.

  • Mat

Hello Mat .

  1. ;-) The option -Mcuda was not indicated in the PGI news
    Account Login | PGI
    (neither in the README coming with the um-eval package )

  2. ;-) I found the problem of the poor performance of my STREAM benchmark …

I have 2 TITAN cards on my test machine !

pgaccelinfo -short
0 GeForce GTX TITAN
1 GeForce GTX TITAN

Setting CUDA_VISIBLE_DEVICES=0 speedup the performance near the optimal one .

Without CUDA_VISIBLE_DEVICES

stream_pgi154_acc_ompi175_manage
...
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:        17502.01565974     ...
Scale:       17508.98335551     ...
Add:         11480.43242445     ...
Triad:       11383.99349557     ...

With CUDA_VISIBLE_DEVICES=0

CUDA_VISIBLE_DEVICES=0 stream_pgi154_acc_ompi175_manage
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:        222708.45205104 
Scale:       221631.04772747
Add:         222303.20011646
Triad:       222423.12092569

So good for a first try !

Bye

Juan

One question left

With GPU DIRECT aware MPI , does the !$acc host_data use_device
work/must be used for managed array ?

otherwise how the system/library know which of the host/device version of the array must be transferred via MPI ?

Bye Juan

Hi Juan,

I’m not sure. I wouldn’t think using “host_data” is needed, but question if CUDA UM works with GPU Direct.

Let me as some folks and get back to you.

  • Mat

Hi Juan,

Here’s the response I got:

If managed memory is used host_data use_device is not necessary (it would not change the value of the pointer anyway). However the MPI implementation needs to be CUDA-aware and aware of unified memory. CUDA-aware builds of OpenMPI 1.8.5, which was released a couple of days ago, is aware of unified memory. If he needs to run on with another or older CUDA-aware MPI implementation he needs to make sure that CUDA IPC (GPUDirect P2P) and GPUDirect RDMA is disable, e.g. for OpenMPI this can be achieved with:

mpiexec --mca btl_smcuda_use_cuda_ipc 0 --mca btl_smcuda_use_cuda_ipc_same_gpu 0

(Explicitly disabling GPUDirect RDMA is not necessary with OpenMPI because it is off by default).

Hope this helps,
Mat

Hello Mat .

With the last pgi/15.5 , I have tested the -ta:tesla:managed memory with openmpi1.8.5 CUDA-aware ( and also with mvapich2-1.a-gdr) .

Apparently activating the flag managed doesn’t inhibited the host_data clause which generate segmentation fault .

Here is an hello_manage.f90 example :

=> Process 0 send 1 to process 1 on GPU . Proc 1 write the result on CPU

program hello_managed

implicit none
include ‘mpif.h’

integer, parameter :: n=256
real , allocatable, dimension(:) :: send_buf,recv_buf
integer :: npe,mype,ierr

call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, mype, ierr)
call mpi_comm_size(mpi_comm_world, npe, ierr)

if (npe.ne.2) STOP ‘run with 2 MPI task only’

allocate( send_buf(n),recv_buf(n) )

!$acc data create(send_buf,recv_buf)

!$acc kernels
send_buf=1.0
recv_buf=-999.0
!$acc end kernels

!$acc host_data use_device(send_buf,recv_buf)
if ( mype .eq. 0 ) then
call MPI_Send(send_buf,n,MPI_REAL,1,0,mpi_comm_world, ierr)
else
call MPI_Recv(recv_buf,n,MPI_REAL,0,0,mpi_comm_world,mpi_status_ignore, ierr)
endif
!$acc end host_data

if ( mype .eq. 1 ) then
!$acc update host(recv_buf(n:n))
print*,‘mype=’,mype,’ recv_buf(n) <must be 1> =',recv_buf(n)
end if

!$acc end data

call mpi_finalize(ierr)

end program hello_managed

=> Compiled without the managed flag , it work correctly <=> thank’s to MPI CUDA-aware

mpif90 -ta=tesla:cuda6.5 hellompi_managed.f90 -o hellompi_tesla
mpirun -np 2 hellompi_tesla
mype= 1 recv_buf(n) <must be 1> = > 1.000000

=> With the managed flag the code crash

mpif90 -Mcuda -ta=tesla:cuda6.5,> managed > hellompi_managed.f90 -o hellompi_managed
mpirun -np 2 hellompi_managed

mpirun noticed that process rank 0 with PID 6225 on node n370 exited on > signal 11 (Segmentation fault)> .

=> Removing the host_data make the code run correctly with the managed flag :

mpif90 -Mcuda -ta=tesla:cuda6.5,managed hellompi_managed.f90 -o hellompi_managed_no_hostdata
mpirun -np 2 hellompi_managed_no_hostdata
mype= 1 recv_buf(n) <must be 1> = > 1.000000

=> Last remark, the update clause is not deactivated to ( by create clause are ) , as shown by the -Minfo=acc

mpif90 -Mcuda -ta=tesla:cuda6.5,> managed > -Minfo=acc hellompi_managed.f90 -o hellompi_managed
hello_managed:
21, Loop is parallelizable
Accelerator kernel generated
21, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
34, > Generating update host> (recv_buf(256)

Bye Juan .

Thanks Juan.

I know in the 15.1 compiler we had a similar issue with host_data, but I thought we had fixed it in 15.2. Let me try and reproduce the problem here. I may be a different issue or we missed a case.

  • Mat

Hi Juan,

I was able to recreate the issue when using OpenMPI 1.8.5 built with RDMA and CUDA 6.5. Interestingly, when I switched to using MPICH2 gdr with CUDA 7.0, the test passed.

I’m still tracking things down, but it could be a CUDA 6.5 or PGI issue. Not sure which. My next step is to build OpenMPI with CUDA 7.0 and see what happens.

  • Mat