alloc of pinned memory has to be _after_ setting device

I have been experimenting with changing devices for a MPI’zed CUDA-Fortran code. For some time I ran into a seg fault when trying to transfer certain arrays to the GPU after changing device.

It turns out that not only does one have to reallocate all device memory (logical, we are clearing the GPU), but also pinned memory has to be reallocated. Is that the expected behavior ?

The seg fault happens when trying to access the pinned data in any way, both copying to the device or accessing it on the host side.

The array is still marked as allocated though, and maintains it shape.

I believe the correct behavior is either that the pinned array should be marked as unallocated or that the data should still be available.

I tested with version 10.8 of the compiler. My workaround is to select the device just at the very beginning of the program, but it would be nice with a consistent state of data (i.e. either unaffected by the resetting of the device or automatically unallocated).

For illustration, the following program works fine :

PROGRAM test_set_device
  USE cudafor
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr
  ierr = cudaThreadExit(); if (ierr > 0) print *,cudaGetErrorString(ierr)
  ierr = cudaSetDevice(0); if (ierr > 0) print *,cudaGetErrorString(ierr)
  allocate( x(10))
  allocate(gx(10))
  print *, allocated(x), shape(x)
  gx = x
END

while this one seg faults at the “gx=x” line :

PROGRAM test_set_device
  USE cudafor
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr
  allocate( x(10))
  ierr = cudaThreadExit(); if (ierr > 0) print *,cudaGetErrorString(ierr)
  ierr = cudaSetDevice(0); if (ierr > 0) print *,cudaGetErrorString(ierr)
  allocate(gx(10))
  print *, allocated(x), shape(x)
  gx = x
END

and this one seg faults at the “y=x(1)” line :

PROGRAM test_set_device
  USE cudafor
  real :: y
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr
  allocate( x(10))
  x(1) = 1
  ierr = cudaThreadExit(); if (ierr > 0) print *,cudaGetErrorString(ierr)
  ierr = cudaSetDevice(0); if (ierr > 0) print *,cudaGetErrorString(ierr)
  allocate(gx(10))
  print *, allocated(x), shape(x)
  y = x(1)
END

Hi TroelsH,

While pinned memory is host side, the CUDA driver manages this data. When you destroy your context via cudaThreadExit call, the CUDA driver will also destroy this data. Hence, this behavior is expected.

The simple work around is to not use pinned memory here. Yes you will loose some performance, but x’s data will be managed by the host and not destroyed when you change context.

A question for you is why you are calling cudaThreadExit? This will destroy all created context. Are you trying to use OpenMP to utilize multiple GPUs and want to share ‘x’ across these GPUs? If so, try setting the devices in parallel first before allocating any data.

For example:

% cat test.cuf
PROGRAM test_set_device
  USE cudafor
  use omp_lib
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr, tnum

! Create your device context in parallel
!$omp parallel private(tnum)
  tnum = omp_get_thread_num()
  print *, 'TNUM:', tnum
  ierr = cudaSetDevice(tnum); if (ierr > 0) print *,cudaGetErrorString(ierr)
!$omp end parallel

! Perform initialization
  allocate(x(10))
  x= 10.1

! Execute the main problem in parallel
!$omp parallel private(tnum)
  tnum = omp_get_thread_num()
  allocate(gx(10))
  gx = x
  print *, tnum, allocated(x), shape(x), x
!$omp end parallel

END

% pgf90 -fast test.cuf -o test.out -V10.8 -mp
% setenv OMP_NUM_THREADS 4
% test.out
 TNUM:            0
 TNUM:            3
 TNUM:            2
 TNUM:            1
            0  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
            3  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
            2  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
            1  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
%

Hope this helps,
Mat

Hi Mat,

Thanks for your answer. It makes goods sense.

We have 4 GPU’s per node, and are using MPI for the parallelization. I use cudaThreadExit+cudaSetDevice to make sure that each MPI thread gets the correct (and unique!) GPU device.

I still think that if the CUDA driver destroys the pinned arrays, they should be marked as deallocated by CUDA Fortran, which is not the case now.

best,

Troels

they should be marked as deallocated by CUDA Fortran, which is not the case now.

I agree and have added a feature request (TPR#17189) to perform garbage collection of device and pinned memory after a call to cudaThreadExit is made.

Thanks,
Mat