Hello everyone,
I’m running this simple test code in Fortran90 and MPI (compiler mpif90 under nvhpc-23.1). I’m running this code with 2 CPUs and 2 GPUs just to check the memory allocation of both GPUs. The code is the following:
program multigpu
use ISO_FORTRAN_ENV, only : INT32
use mpi
use openacc
implicit none
integer(kind=INT32), allocatable, dimension(:,:,:) :: a
integer :: comm_size, LOCAL_COMM, my_rank, code, i, j, k
integer :: ni, nj, nk
integer :: num_gpus, my_gpu
integer(kind=acc_device_kind) :: device_type
integer(c_size_t) :: free_mem, total_mem
!$acc declare create(free_mem, total_mem)
! MPI stuff
call MPI_Init(code)
call MPI_comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, LOCAL_COMM, code)
call MPI_Comm_size(MPI_COMM_WORLD, comm_size, code)
call MPI_Comm_rank(LOCAL_COMM, my_rank, code)
! OpenACC stuff
if (my_rank == 0) print*, "Using Multi-GPU OpenACC"
ni = 250
nj = 250
nk = 50
if (my_rank == 0) then
allocate(a(ni,nj,nk))
else
ni = 900-ni
nj = 900-nj
nk = 100-nk
allocate(a(ni,nj,nk))
endif
!$acc enter data create(a(:,:,:))
print*, "My Rank =",my_rank,"Allocated a(:,:,:)=",size(a)
a = 0.D0
!$acc update device(a(:,:,:))
device_type = acc_get_device_type()
num_gpus = acc_get_num_devices(device_type)
call acc_set_device_num(my_rank, device_type)
my_gpu = acc_get_device_num(device_type)
total_mem = acc_get_property( my_gpu, device_type, acc_property_memory)
free_mem = acc_get_property( my_gpu, device_type, acc_property_free_memory)
print *, "Free Mem: ", free_mem/1e+09,"GB"
print *, "Total Mem: ", total_mem/1e+09,"GB"
!$acc parallel loop collapse(3)
do k = 1, nk
do j = 1, nj
do i = 1, ni
a(i,j,k) = 5
if (i==1.and.j==1.and.k==1) then
print*, my_gpu, size(a)
endif
enddo
enddo
enddo
!$acc update self(a(:,:,:))
write(0,"(a13,i2,a17,i2,a8,i2,a10,i2)") "Here is rank ",my_rank,": I am using GPU",my_gpu, &
" of type ",device_type,". a(42) = ",a(42)
!$acc exit data delete(a)
deallocate(a)
total_mem = acc_get_property( my_gpu, device_type, acc_property_memory)
free_mem = acc_get_property( my_gpu, device_type, acc_property_free_memory)
print *, "Free Mem: ", free_mem/1e+09,"GB"
print *, "Total Mem: ", total_mem/1e+09,"GB"
print *, "Occupied: ", (total_mem-free_mem)/1e+09,"GB"
call MPI_Finalize(code)
print*, "The End..."
end program multigpu
However, when I print the memory after allocation and after deallocation, I don’t see a big differences. I’ve also tried to change the dimension of the vector a
. Moreover, if the dimension of a
is the same for both CPUs (so also is the same GPUs), I have different memory between the two devices. In particular this is what I get:
Using Multi-GPU OpenACC
My Rank = 1 Allocated a(:,:,:)= 10125000
My Rank = 0 Allocated a(:,:,:)= 10125000
Free Mem: 40.97245184000000 GB
Total Mem: 42.29883494400000 GB
Occupied: 1.326383104000000 GB
0 10125000
Here is rank 0 : I am using GPU 0 of type 4. a(42) = 5
Free Mem: 40.82774835200000 GB
Total Mem: 42.29883494400000 GB
Occupied: 1.471086592000000 GB
Free Mem: 37.03190323200000 GB
Total Mem: 42.29883494400000 GB
Occupied: 5.266931712000000 GB
1 10125000
Here is rank 1 : I am using GPU 1 of type 4. a(42) = 5
Free Mem: 36.88719974400000 GB
Total Mem: 42.29883494400000 GB
Occupied: 5.411635200000000 GB
The End...
The End...
As you can see, the allocation of a
is the same for both CPUs and GPUs. It can also be noticed that the dimension after deallocation is greater than before allocation that is almost impossible! Surely, there is something wrong inside the implementation. Am I missing something?
Thank you all