CUDA Fortran Book Memory Allocation Error

I am running mpiDevices.cuf from the book CUDA Fortran for Scientists and Engineers. I have 32 Intel processors and a NVIDIA GV100. My operating system is Ubuntu. I am running PGI Community Edition Version 18.10.

When I run the code with two processors, I get warnings and the following output:

mpiexec -np 2 mpiDevices.out

WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

Local host: BWA

[0] using device: 0 in compute mode: 0
[1] using device: 0 in compute mode: 0
[1] after allocation on rank: 0, device arrays allocated: 1
[0] after allocation on rank: 0, device arrays allocated: 1
[1] after allocation on rank: 1, device arrays allocated: 2
[0] after allocation on rank: 1, device arrays allocated: 2
Test Passed
Test Passed
[BWA:11385] 1 more process has sent help message help-btl-vader.txt / cma-permission-denied
[BWA:11385] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

The output is correct, but I get a warning. What is the warning about CMA memory? When I run the code with 3 processors, the output is incorrect and there is still a warning:

mpiexec -np 3 mpiDevices.out

WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

Local host: BWA

[0] using device: 0 in compute mode: 0
[1] using device: 0 in compute mode: 0
[2] using device: 0 in compute mode: 0
[0] after allocation on rank: 0, device arrays allocated: 1
[2] after allocation on rank: 0, device arrays allocated: 1
[1] after allocation on rank: 0, device arrays allocated: 2
[2] after allocation on rank: 1, device arrays allocated: 2
[0] after allocation on rank: 1, device arrays allocated: 2
[1] after allocation on rank: 1, device arrays allocated: 3
[0] after allocation on rank: 2, device arrays allocated: 3
[2] after allocation on rank: 2, device arrays allocated: 3
[1] after allocation on rank: 2, device arrays allocated: 4
Test Passed
Test Passed
Test Passed
[BWA:11411] 2 more processes have sent help message help-btl-vader.txt / cma-permission-denied
[BWA:11411] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

The output shows that 4 arrays have been allocated, which should not be possible. The error gets worse as I increase the number of processors. Here is mpiDevices.cuf:

! 
!     Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
!
! NVIDIA CORPORATION and its licensors retain all intellectual property
! and proprietary rights in and to this software, related documentation
! and any modifications thereto.
!
!
!    These example codes are a portion of the code samples from the companion
!    website to the book "CUDA Fortran for Scientists and Engineers":
!
! http://store.elsevier.com/product.jsp?isbn=9780124169708
!

program mpiDevices
  use cudafor
  use mpi
  implicit none

  ! global array size
  integer, parameter :: n = 1024*1024
  ! MPI  variables
  integer :: myrank, nprocs, ierr
  ! device 
  type(cudaDeviceProp) :: prop
  integer(int_ptr_kind()) :: freeB, totalB, freeA, totalA 
  real, device, allocatable :: d(:)
  integer :: i, j, istat

  ! MPI initialization
  call MPI_init(ierr)
  call MPI_comm_rank(MPI_COMM_WORLD, myrank, ierr)
  call MPI_comm_size(MPI_COMM_WORLD, nProcs, ierr)

  ! print compute mode for device
  istat = cudaGetDevice(j)
  istat = cudaGetDeviceProperties(prop, j)
  do i = 0, nprocs-1
     call MPI_BARRIER(MPI_COMM_WORLD, ierr)
     if (myrank == i) write(*,"('[',i0,'] using device: ', &
          i0, ' in compute mode: ', i0)") &
          myrank, j, prop%computeMode
  enddo

  ! get memory use before large allocations, 
  call MPI_BARRIER(MPI_COMM_WORLD, ierr)
  istat = cudaMemGetInfo(freeB, totalB)

  ! now allocate arrays, one rank at a time
  do j = 0, nProcs-1

     ! allocate on device associated with rank j
     call MPI_BARRIER(MPI_COMM_WORLD, ierr)
     if (myrank == j) allocate(d(n)) 
     
     ! Get free memory after allocation
     call MPI_BARRIER(MPI_COMM_WORLD, ierr)
     istat = cudaMemGetInfo(freeA, totalA)
     
     write(*,"('  [',i0,'] after allocation on rank: ', i0, &
          ', device arrays allocated: ', i0)") &
          myrank, j, (freeB-freeA)/n/4
    
  end do

  deallocate(d)

  if (istat .ne. 0) then
     write(*,*) "Test Failed"
  else
     write(*,*) "Test Passed"
  endif
 
  call MPI_Finalize(ierr)
end program mpiDevices

Are my problems associated with Linux operating system or the PGI compiler?

I can fix the warning message by altering the setting for ptrace_scope using the following command:

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

However, mpiDevices.out continues to allocate too many arrays for np>2. Are the other settings that I should change?

Thank you, Doug.

Hi DougD,

The CMA error looks to be an MPI or OS issue so I’m unable to help there.

Though the free memory issue is because your not accounting for the CUDA context. The context takes up memory and is not created until the first use of the device. Hence when you record the initial free memory, not all the contexts are created yet. The extra allocation you’re seeing is the memory the context is using.

To fix, you need to add some code to create the context before recording the initial memory size. The easiest way to do this is to call cudaFree(d).

  ! print compute mode for device
  istat = cudaGetDevice(j)
  istat = cudaGetDeviceProperties(prop, j)
  istat = cudaFree(d)
  do i = 0, nprocs-1



% mpirun -np 3 a.out
[0] using device: 0 in compute mode: 0
[1] using device: 0 in compute mode: 0
[2] using device: 0 in compute mode: 0
  [2] after allocation on rank: 0, device arrays allocated: 1
  [1] after allocation on rank: 0, device arrays allocated: 1
  [0] after allocation on rank: 0, device arrays allocated: 1
  [2] after allocation on rank: 1, device arrays allocated: 2
  [0] after allocation on rank: 1, device arrays allocated: 2
  [1] after allocation on rank: 1, device arrays allocated: 2
  [0] after allocation on rank: 2, device arrays allocated: 3
  [1] after allocation on rank: 2, device arrays allocated: 3
  [2] after allocation on rank: 2, device arrays allocated: 3
 Test Passed
 Test Passed
 Test Passed

Hope this helps,
Mat

Hi Mat,

Thanks for your fix. I implemented your fix to create the CUDA context before the arrays are allocated. I also added some coding to print the total available free memory before arrays are allocated. The initial total available free memory is dramatically reduced as I increase the number of MPI threads. Here is listing as a function of MPI threads:

  • #MPI Threads, Initial Free Memory (GB)
    10, 27.04
    20, 22.74
    30, 18.44
    40, 14.14
    50, 9.85
    60, 5.55

The code crashes after I try to run 62 threads.

I am porting a MPI Fortran cfd code to run on the GPU. The code currently uses 64 threads. It does not look as though I will be able to send 64 kernels to the GPU. The CUDA context takes up a lot of memory. I will have to use far fewer MPI threads, which will require reworking the core solver. My concern is that some portions of the solver will not execute well on the GPU. For those portions of the solver, I had planned to execute them on the host with lots of threads using a subdomain approach. Now using lots of threads, does not look like option due to memory constraints. Is there a way around this problem? On the other hand, having fewer threads means that I will have larger memory transfers between the host and the device, which will be more efficient.

Thank you, Doug.

Hi DougD,

The CUDA context takes up a lot of memory.

The context is approximately 420MB.

Is there a way around this problem?

First, you’ll want to use MPS to manage the multiple contexts. The caveat being that MPS has a max of 48 contexts so you’ll only be able to run 48 ranks.

You can then set the environment variable CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to a small value, such as 20. This will have the side effect of reducing the context size.

See: https://docs.nvidia.com/deploy/mps/index.html#topic_5_2_5
and
https://docs.nvidia.com/deploy/mps/index.html#topic_3_3_5_1


On the other hand, having fewer threads means that I will have larger memory transfers between the host and the device, which will be more efficient.

Not necessarily. Data transfers over PCIe are serialized so whether the transfer is done from one rank or 48 ranks, the total data transfer time will most likely be about the same.

Plus you’ll have more contention on the GPU as more ranks are added.

Most likely you’ll be better off reducing the number of ranks to 2 or 4 per GPU. As one rank is doing computation, another can be transferring data.

Hope this helps,
Mat

Hi Mat,

Thank you. Your suggestion will allow me to work with more ranks on the host. I want to be able to perform large simulations on my personal workstation. At present, I have one GV100. I have room for one more GV100 if the code shows promise. The simulation that I want to do of wind-driven breaking waves will take 24 days with 64 processors. I am hopeful that the GPU(s) will speed things up. If I am able to perform these large simulations, I may be able to get a grant on a supercomputer that has one GPU for every one to two ranks as you suggest.

Thank you, Doug.

For me, running this command solve this warning:

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope