I am running mpiDevices.cuf from the book CUDA Fortran for Scientists and Engineers. I have 32 Intel processors and a NVIDIA GV100. My operating system is Ubuntu. I am running PGI Community Edition Version 18.10.
When I run the code with two processors, I get warnings and the following output:
mpiexec -np 2 mpiDevices.out
WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.Local host: BWA
[0] using device: 0 in compute mode: 0
[1] using device: 0 in compute mode: 0
[1] after allocation on rank: 0, device arrays allocated: 1
[0] after allocation on rank: 0, device arrays allocated: 1
[1] after allocation on rank: 1, device arrays allocated: 2
[0] after allocation on rank: 1, device arrays allocated: 2
Test Passed
Test Passed
[BWA:11385] 1 more process has sent help message help-btl-vader.txt / cma-permission-denied
[BWA:11385] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
The output is correct, but I get a warning. What is the warning about CMA memory? When I run the code with 3 processors, the output is incorrect and there is still a warning:
mpiexec -np 3 mpiDevices.out
WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.Local host: BWA
[0] using device: 0 in compute mode: 0
[1] using device: 0 in compute mode: 0
[2] using device: 0 in compute mode: 0
[0] after allocation on rank: 0, device arrays allocated: 1
[2] after allocation on rank: 0, device arrays allocated: 1
[1] after allocation on rank: 0, device arrays allocated: 2
[2] after allocation on rank: 1, device arrays allocated: 2
[0] after allocation on rank: 1, device arrays allocated: 2
[1] after allocation on rank: 1, device arrays allocated: 3
[0] after allocation on rank: 2, device arrays allocated: 3
[2] after allocation on rank: 2, device arrays allocated: 3
[1] after allocation on rank: 2, device arrays allocated: 4
Test Passed
Test Passed
Test Passed
[BWA:11411] 2 more processes have sent help message help-btl-vader.txt / cma-permission-denied
[BWA:11411] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
The output shows that 4 arrays have been allocated, which should not be possible. The error gets worse as I increase the number of processors. Here is mpiDevices.cuf:
!
! Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
!
! NVIDIA CORPORATION and its licensors retain all intellectual property
! and proprietary rights in and to this software, related documentation
! and any modifications thereto.
!
!
! These example codes are a portion of the code samples from the companion
! website to the book "CUDA Fortran for Scientists and Engineers":
!
! http://store.elsevier.com/product.jsp?isbn=9780124169708
!
program mpiDevices
use cudafor
use mpi
implicit none
! global array size
integer, parameter :: n = 1024*1024
! MPI variables
integer :: myrank, nprocs, ierr
! device
type(cudaDeviceProp) :: prop
integer(int_ptr_kind()) :: freeB, totalB, freeA, totalA
real, device, allocatable :: d(:)
integer :: i, j, istat
! MPI initialization
call MPI_init(ierr)
call MPI_comm_rank(MPI_COMM_WORLD, myrank, ierr)
call MPI_comm_size(MPI_COMM_WORLD, nProcs, ierr)
! print compute mode for device
istat = cudaGetDevice(j)
istat = cudaGetDeviceProperties(prop, j)
do i = 0, nprocs-1
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
if (myrank == i) write(*,"('[',i0,'] using device: ', &
i0, ' in compute mode: ', i0)") &
myrank, j, prop%computeMode
enddo
! get memory use before large allocations,
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
istat = cudaMemGetInfo(freeB, totalB)
! now allocate arrays, one rank at a time
do j = 0, nProcs-1
! allocate on device associated with rank j
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
if (myrank == j) allocate(d(n))
! Get free memory after allocation
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
istat = cudaMemGetInfo(freeA, totalA)
write(*,"(' [',i0,'] after allocation on rank: ', i0, &
', device arrays allocated: ', i0)") &
myrank, j, (freeB-freeA)/n/4
end do
deallocate(d)
if (istat .ne. 0) then
write(*,*) "Test Failed"
else
write(*,*) "Test Passed"
endif
call MPI_Finalize(ierr)
end program mpiDevices
Are my problems associated with Linux operating system or the PGI compiler?
I can fix the warning message by altering the setting for ptrace_scope using the following command:
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
However, mpiDevices.out continues to allocate too many arrays for np>2. Are the other settings that I should change?
Thank you, Doug.