Unified memory - more than 1 GPU


Is there a plan to add support for unified memory with more than 1 GPU?

Hi hhward,

CUDA Unified Memory support on more than one GPU has been available for quite some time.

See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-multi-gpu


Thanks for your reply.

Is it supported with openacc too?

Is it supported with openacc too?

Yes, PGI’s implementation of OpenACC does support CUDA Unified Memory. You can enable it via the compiler flag “-ta=tesla:managed”.

For details please see: https://www.pgroup.com/resources/docs/18.10/x86/pgi-user-guide/index.htm#acc-mem-unified

Hope this helps,

Thank you for your reply.

I have tested this, but it seems that it is only using one GPU.

I have a test case where I have 17.7GB of data on a computer with four P100(16GB pr card). The memory usage(using nvidia-smi) shows that it is only using one of the cards.

pgfortran test.f90 -o xacc -mp=allcores -ta:tesla:managed

is used to compile.

It looks like all calculations are done on one card.

Do you have any idea what’s wrong?

How are you assigning the OpenMP threads to the GPU devices?

Are the OpenACC regions within the OpenMP parallel regions?

To set the device number, you’ll want something like this in early in the code:

     use openmp
     use openacc

     devNum = acc_get_num_devices(acc_get_device_type())
!$omp parallel private(thid,dev)
     thid = omp_get_thread_num()
     dev = mod(thid,devNum)
     call acc_set_device_num(acc_get_device_type(), dev)
     call acc_init(acc_get_device_type())
!$omp end parallel

The OpenMP threads retain the same device for subsequent parallel regions.

! run parallel host threads
!$omp parallel loop
do i=1,N
! offload to the device
!$acc parallel loop
do j=1,M

Hope this helps,