alloc of pinned memory has to be _after_ setting device

TroelsH · August 19, 2010, 11:55am

I have been experimenting with changing devices for a MPI’zed CUDA-Fortran code. For some time I ran into a seg fault when trying to transfer certain arrays to the GPU after changing device.

It turns out that not only does one have to reallocate all device memory (logical, we are clearing the GPU), but also pinned memory has to be reallocated. Is that the expected behavior ?

The seg fault happens when trying to access the pinned data in any way, both copying to the device or accessing it on the host side.

The array is still marked as allocated though, and maintains it shape.

I believe the correct behavior is either that the pinned array should be marked as unallocated or that the data should still be available.

I tested with version 10.8 of the compiler. My workaround is to select the device just at the very beginning of the program, but it would be nice with a consistent state of data (i.e. either unaffected by the resetting of the device or automatically unallocated).

For illustration, the following program works fine :

PROGRAM test_set_device
  USE cudafor
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr
  ierr = cudaThreadExit(); if (ierr > 0) print *,cudaGetErrorString(ierr)
  ierr = cudaSetDevice(0); if (ierr > 0) print *,cudaGetErrorString(ierr)
  allocate( x(10))
  allocate(gx(10))
  print *, allocated(x), shape(x)
  gx = x
END

while this one seg faults at the “gx=x” line :

PROGRAM test_set_device
  USE cudafor
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr
  allocate( x(10))
  ierr = cudaThreadExit(); if (ierr > 0) print *,cudaGetErrorString(ierr)
  ierr = cudaSetDevice(0); if (ierr > 0) print *,cudaGetErrorString(ierr)
  allocate(gx(10))
  print *, allocated(x), shape(x)
  gx = x
END

and this one seg faults at the “y=x(1)” line :

PROGRAM test_set_device
  USE cudafor
  real :: y
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr
  allocate( x(10))
  x(1) = 1
  ierr = cudaThreadExit(); if (ierr > 0) print *,cudaGetErrorString(ierr)
  ierr = cudaSetDevice(0); if (ierr > 0) print *,cudaGetErrorString(ierr)
  allocate(gx(10))
  print *, allocated(x), shape(x)
  y = x(1)
END

MatColgrove · August 19, 2010, 6:29pm

Hi TroelsH,

While pinned memory is host side, the CUDA driver manages this data. When you destroy your context via cudaThreadExit call, the CUDA driver will also destroy this data. Hence, this behavior is expected.

The simple work around is to not use pinned memory here. Yes you will loose some performance, but x’s data will be managed by the host and not destroyed when you change context.

A question for you is why you are calling cudaThreadExit? This will destroy all created context. Are you trying to use OpenMP to utilize multiple GPUs and want to share ‘x’ across these GPUs? If so, try setting the devices in parallel first before allocating any data.

For example:

% cat test.cuf
PROGRAM test_set_device
  USE cudafor
  use omp_lib
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr, tnum

! Create your device context in parallel
!$omp parallel private(tnum)
  tnum = omp_get_thread_num()
  print *, 'TNUM:', tnum
  ierr = cudaSetDevice(tnum); if (ierr > 0) print *,cudaGetErrorString(ierr)
!$omp end parallel

! Perform initialization
  allocate(x(10))
  x= 10.1

! Execute the main problem in parallel
!$omp parallel private(tnum)
  tnum = omp_get_thread_num()
  allocate(gx(10))
  gx = x
  print *, tnum, allocated(x), shape(x), x
!$omp end parallel

END

% pgf90 -fast test.cuf -o test.out -V10.8 -mp
% setenv OMP_NUM_THREADS 4
% test.out
 TNUM:            0
 TNUM:            3
 TNUM:            2
 TNUM:            1
            0  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
            3  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
            2  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
            1  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
%

Hope this helps,
Mat

TroelsH · August 20, 2010, 4:31pm

Hi Mat,

Thanks for your answer. It makes goods sense.

We have 4 GPU’s per node, and are using MPI for the parallelization. I use cudaThreadExit+cudaSetDevice to make sure that each MPI thread gets the correct (and unique!) GPU device.

I still think that if the CUDA driver destroys the pinned arrays, they should be marked as deallocated by CUDA Fortran, which is not the case now.

best,

Troels

MatColgrove · August 20, 2010, 7:53pm

they should be marked as deallocated by CUDA Fortran, which is not the case now.

I agree and have added a feature request (TPR#17189) to perform garbage collection of device and pinned memory after a call to cudaThreadExit is made.

Thanks,
Mat

Topic		Replies	Views
Error with pinned memory and threads on the host Legacy PGI Compilers	3	3717	August 10, 2017
Mapped memory across multiple GPUs CUDA Programming and Performance	3	8737	October 28, 2010
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3891	January 28, 2022
MultiGPU start help CUDA Programming and Performance	8	10522	August 10, 2010
Pinned memory error invalid device pointer CUDA Programming and Performance	9	6074	April 10, 2009
How to make host pinned shared memory across process fork(2)? CUDA Programming and Performance	14	5161	January 6, 2015
Random segmentation fault Legacy PGI Compilers	12	1261	December 30, 2020
cudaSetDevice seems completely broken Legacy PGI Compilers	12	15851	December 30, 2010
Declaring local arrays in device code Legacy PGI Compilers	16	9067	June 8, 2012
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21311	February 26, 2010

alloc of pinned memory has to be _after_ setting device

Related topics