OpenACC Multi GPU Memory Informations

matteo.cimini · January 30, 2024, 1:20pm

Hello everyone,

I’m running this simple test code in Fortran90 and MPI (compiler mpif90 under nvhpc-23.1). I’m running this code with 2 CPUs and 2 GPUs just to check the memory allocation of both GPUs. The code is the following:

program multigpu
use ISO_FORTRAN_ENV, only : INT32
use mpi
use openacc

implicit none
integer(kind=INT32), allocatable, dimension(:,:,:) :: a
integer :: comm_size, LOCAL_COMM, my_rank, code, i, j, k
integer :: ni, nj, nk
integer :: num_gpus, my_gpu
integer(kind=acc_device_kind) :: device_type

integer(c_size_t) :: free_mem, total_mem
!$acc declare create(free_mem, total_mem)

! MPI stuff
call MPI_Init(code)
call MPI_comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, LOCAL_COMM, code)
call MPI_Comm_size(MPI_COMM_WORLD, comm_size, code)
call MPI_Comm_rank(LOCAL_COMM, my_rank, code)

! OpenACC stuff
if (my_rank == 0) print*, "Using Multi-GPU OpenACC"

ni = 250
nj = 250
nk = 50
if (my_rank == 0) then
        allocate(a(ni,nj,nk))
else
        ni = 900-ni
        nj = 900-nj
        nk = 100-nk
        allocate(a(ni,nj,nk))
endif
!$acc enter data create(a(:,:,:))


print*, "My Rank =",my_rank,"Allocated a(:,:,:)=",size(a)
a = 0.D0
!$acc update device(a(:,:,:))

device_type = acc_get_device_type()
num_gpus = acc_get_num_devices(device_type)
call acc_set_device_num(my_rank, device_type)
my_gpu = acc_get_device_num(device_type)

total_mem = acc_get_property( my_gpu, device_type, acc_property_memory)
free_mem = acc_get_property( my_gpu, device_type, acc_property_free_memory)
print *, "Free Mem: ", free_mem/1e+09,"GB"
print *, "Total Mem: ", total_mem/1e+09,"GB"

!$acc parallel loop collapse(3) 
do k = 1, nk
 do j = 1, nj
  do i = 1, ni
   a(i,j,k) = 5
   if (i==1.and.j==1.and.k==1) then
    print*, my_gpu, size(a)
   endif
  enddo
 enddo
enddo
!$acc update self(a(:,:,:))


write(0,"(a13,i2,a17,i2,a8,i2,a10,i2)") "Here is rank ",my_rank,": I am using GPU",my_gpu, &
                                        " of type ",device_type,". a(42) = ",a(42)

!$acc exit data delete(a)
deallocate(a)

total_mem = acc_get_property( my_gpu, device_type, acc_property_memory)
free_mem = acc_get_property( my_gpu, device_type, acc_property_free_memory)
print *, "Free Mem: ", free_mem/1e+09,"GB"
print *, "Total Mem: ", total_mem/1e+09,"GB"
print *, "Occupied: ", (total_mem-free_mem)/1e+09,"GB"

call MPI_Finalize(code)
print*, "The End..."

end program multigpu

However, when I print the memory after allocation and after deallocation, I don’t see a big differences. I’ve also tried to change the dimension of the vector a. Moreover, if the dimension of a is the same for both CPUs (so also is the same GPUs), I have different memory between the two devices. In particular this is what I get:

 Using Multi-GPU OpenACC
 My Rank =            1 Allocated a(:,:,:)=     10125000
 My Rank =            0 Allocated a(:,:,:)=     10125000
 Free Mem:     40.97245184000000      GB
 Total Mem:     42.29883494400000      GB
 Occupied:     1.326383104000000      GB
            0     10125000
Here is rank  0 : I am using GPU 0 of type 4. a(42) =  5
 Free Mem:     40.82774835200000      GB
 Total Mem:     42.29883494400000      GB
 Occupied:     1.471086592000000      GB
 Free Mem:     37.03190323200000      GB
 Total Mem:     42.29883494400000      GB
 Occupied:     5.266931712000000      GB
            1     10125000
Here is rank  1 : I am using GPU 1 of type 4. a(42) =  5
 Free Mem:     36.88719974400000      GB
 Total Mem:     42.29883494400000      GB
 Occupied:     5.411635200000000      GB
 The End...
 The End...

As you can see, the allocation of a is the same for both CPUs and GPUs. It can also be noticed that the dimension after deallocation is greater than before allocation that is almost impossible! Surely, there is something wrong inside the implementation. Am I missing something?

Thank you all

scamp1 · January 30, 2024, 4:20pm

Hi -

So, I have a couple of problems with your code. First - you’re allocating and creating the a array on the GPU before you’ve assigned the acc device it should be created on. Just running your code, I got present clause issues because the data wasn’t created on the second GPU. Second, this code should definitely crash as written because a(42) isn’t a real value - the a array is 3 dimensional, so trying to write just a single array entry into that isn’t correct. Lastly, just for ease of reading - I think you should label each output as to which MPI task it comes from and make clear a “BEFORE” and “AFTER”. I also would add an occupied before entry for parity. I would suggest the code should look like this:

program multigpu
use ISO_FORTRAN_ENV, only : INT32
use mpi
use openacc

implicit none
integer(kind=INT32), allocatable, dimension(:,:,:) :: a
integer :: comm_size, LOCAL_COMM, my_rank, code, i, j, k
integer :: ni, nj, nk
integer :: num_gpus, my_gpu
integer(kind=acc_device_kind) :: device_type

integer(c_size_t) :: free_mem, total_mem
!$acc declare create(free_mem, total_mem)

! MPI stuff
call MPI_Init(code)
call MPI_comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, LOCAL_COMM, code)
call MPI_Comm_size(MPI_COMM_WORLD, comm_size, code)
call MPI_Comm_rank(LOCAL_COMM, my_rank, code)

device_type = acc_get_device_type()
num_gpus = acc_get_num_devices(device_type)
call acc_set_device_num(my_rank, device_type)
my_gpu = acc_get_device_num(device_type)

! OpenACC stuff
if (my_rank == 0) print*, “Using Multi-GPU OpenACC”

ni = 250
nj = 250
nk = 50
if (my_rank == 0) then
allocate(a(ni,nj,nk))
else
ni = 900-ni
nj = 900-nj
nk = 100-nk
allocate(a(ni,nj,nk))
endif
!$acc enter data create(a(:,:,:))

print*, “My Rank =”,my_rank,“Allocated a(:,:,:)=”,size(a)
a = 0.D0
!$acc update device(a(:,:,:))

total_mem = acc_get_property( my_gpu, device_type, acc_property_memory)
free_mem = acc_get_property( my_gpu, device_type, acc_property_free_memory)
print *, my_rank, "Free Mem: ", free_mem/1e+09,“GB BEFORE”
print *, my_rank, "Total Mem: ", total_mem/1e+09,“GB BEFORE”
print *, my_rank, "Occupied: ", (total_mem-free_mem)/1e+09,“GB BEFORE”

!$acc parallel loop collapse(3)
do k = 1, nk
do j = 1, nj
do i = 1, ni
a(i,j,k) = 5
if (i==1.and.j==1.and.k==1) then
print*, my_rank, my_gpu, size(a)
endif
enddo
enddo
enddo
!$acc update host(a(:,:,:))

write(0,“(a13,i2,a17,i2,a8,i2,a10,i2)”) “Here is rank “,my_rank,”: I am using GPU”,my_gpu, &
" of type “,device_type,”. a(42) = ",a(42,42,42)

!$acc exit data delete(a)
deallocate(a)

total_mem = acc_get_property( my_gpu, device_type, acc_property_memory)
free_mem = acc_get_property( my_gpu, device_type, acc_property_free_memory)
print *, my_rank, "Free Mem: ", free_mem/1e+09,“GB AFTER”
print *, my_rank, "Total Mem: ", total_mem/1e+09,“GB AFTER”
print *, my_rank, "Occupied: ", (total_mem-free_mem)/1e+09,“GB AFTER”

call MPI_Finalize(code)
print*, my_rank,“The End…”

end program multigpu

Using this as my program test.f90, I compile this with the following command “mpif90 -O3 -acc test.f90 -o test” using nvhpc 23.1 and an openmpi4 implentation built with nvhpc 23.1. Running this on an intel processor with volta GPUs as “mpirun -n 2 test”, I get the following output for this situation:

Using Multi-GPU OpenACC
My Rank =            0 Allocated a(:,:,:)=      3125000
         0 Free Mem:     16.06051     GB BEFORE
         0 Total Mem:     16.94551     GB BEFORE
         0 Occupied:    0.8849981     GB BEFORE
         0            0      3125000
Here is rank  0 : I am using GPU 0 of type 4. a(42) =  5
         0 Free Mem:     16.06253     GB AFTER
         0 Total Mem:     16.94551     GB AFTER
         0 Occupied:    0.8829834     GB AFTER
My Rank =            1 Allocated a(:,:,:)=     21125000
         1 Free Mem:     16.44115     GB BEFORE
         1 Total Mem:     16.94551     GB BEFORE
         1 Occupied:    0.5043650     GB BEFORE
         1            1     21125000
Here is rank  1 : I am using GPU 1 of type 4. a(42) =  5
         1 Free Mem:     16.51516     GB AFTER
         1 Total Mem:     16.94551     GB AFTER
         1 Occupied:    0.4303503     GB AFTER
         0 The End...
         1 The End...

So for this case I see that both GPUS start with some memory occupied and I see that after I free memory, less memory is occupied. This doesn’t surprise me. If I alter your code so that both cases have the same allocation, effectively changing the “else” block of the ni, nj, and nk setting block to:

    ni = 500-ni
    nj = 500-nj
    nk = 100-nk

In this case, I get the following output:

Using Multi-GPU OpenACC
My Rank =            0 Allocated a(:,:,:)=      3125000
        0 Free Mem:     16.06051     GB BEFORE
        0 Total Mem:     16.94551     GB BEFORE
        0 Occupied:    0.8849981     GB BEFORE
        0            0      3125000
Here is rank  0 : I am using GPU 0 of type 4. a(42) =  5
        0 Free Mem:     16.06253     GB AFTER
        0 Total Mem:     16.94551     GB AFTER
        0 Occupied:    0.8829834     GB AFTER
My Rank =            1 Allocated a(:,:,:)=      3125000
        1 Free Mem:     16.51455     GB BEFORE
        1 Total Mem:     16.94551     GB BEFORE
        1 Occupied:    0.4309647     GB BEFORE
        1            1      3125000
Here is rank  1 : I am using GPU 1 of type 4. a(42) =  5
        1 Free Mem:     16.51656     GB AFTER
        1 Total Mem:     16.94551     GB AFTER
        1 Occupied:    0.4289500     GB AFTER
        1 The End...
        0 The End...

So essentially in both cases, the GPUs start with slightly different amounts of memory free and occupied, but they return to essentially the same state - within a fraction of a GB or so, which may just be additional memory overhead during the run. Can you see how the new code behaves in your situation?

Also, if you still see the problem, please share what type of system you’re running on (cpu and gpu), and what compiler flags are you passing in when you compile it? This will help me better try to recreate the situation. If we see that your problem persists and isn’t reproducible on our side, then it could be related to the system hardware or something particular about the MPI implementation you’re using.

matteo.cimini · January 31, 2024, 10:07am

Yes you’re completely right. I corrected all the mistakes you found, thanks for this. I’m running on a HPC prototype cluster with 2 NVIDIA A100-PCIE-40GB.

When compiling in this way:

mpif90 -r8 -acc=gpu,noautopar -target=gpu -gpu=cc80,managed -Mpreprocess -Mfree -Mextend -Munixlogical -Mbyteswapio -traceback -Mchkstk -Mnostack_arrays -Mnofprelaxed -Mnofpapprox -Minfo=accel -o multigpu multigpu.f90

I obtain this strange result:

 Using Multi-GPU OpenACC
 My Rank =            0 Allocated a(:,:,:)=      3125000
            0 Free Mem:     40.94558208000000      GB BEFORE
            0 Total Mem:     42.29883494400000      GB BEFORE
            0 Occupied:     1.353252864000000      GB BEFORE
            0            0      3125000
Here is rank  0 : I am using GPU 0 of type 4. a(42) =  5
            0 Free Mem:     40.93509632000000      GB AFTER
            0 Total Mem:     42.29883494400000      GB AFTER
            0 Occupied:     1.363738624000000      GB AFTER
 My Rank =            1 Allocated a(:,:,:)=      3125000
            1 Free Mem:     36.99205734400000      GB BEFORE
            1 Total Mem:     42.29883494400000      GB BEFORE
            1 Occupied:     5.306777600000000      GB BEFORE
            1            1      3125000
Here is rank  1 : I am using GPU 1 of type 4. a(42) =  5
            1 Free Mem:     36.98157158400000      GB AFTER
            1 Total Mem:     42.29883494400000      GB AFTER
            1 Occupied:     5.317263360000000      GB AFTER
            0 The End...
            1 The End...

Basically after deallocation the occupied memory is greater than before! Not possible in my point of view. However, if I compile with the simple:

mpif90 -O3 -acc multigpu.f90 -o multigpu

I obtain:

 Using Multi-GPU OpenACC
 My Rank =            0 Allocated a(:,:,:)=      3125000
            0 Free Mem:     40.94329     GB BEFORE
            0 Total Mem:     42.29884     GB BEFORE
            0 Occupied:     1.355547     GB BEFORE
            0            0      3125000
Here is rank  0 : I am using GPU 0 of type 4. a(42) =  5
            0 Free Mem:     40.94530     GB AFTER
            0 Total Mem:     42.29884     GB AFTER
            0 Occupied:     1.353532     GB AFTER
 My Rank =            1 Allocated a(:,:,:)=      3125000
            1 Free Mem:     36.99206     GB BEFORE
            1 Total Mem:     42.29884     GB BEFORE
            1 Occupied:     5.306777     GB BEFORE
            1            1      3125000
Here is rank  1 : I am using GPU 1 of type 4. a(42) =  5
            1 Free Mem:     36.99407     GB AFTER
            1 Total Mem:     42.29884     GB AFTER
            1 Occupied:     5.304763     GB AFTER
            0 The End...
            1 The End...

Now this results make sense and it is ok. Nevertheless, I notice that the unbalance between the two GPUs is higher compared to your results but I really don’t know why.

Why in your opinion the behavior is completely different when compiling with all that commands? I ask because my real code that I’m trying to convert in OpenACC is compiled in this complex way.

Thank you again,
-Matteo

scamp1 · January 31, 2024, 2:41pm

Ah - this is expected. When you turn on -gpu=managed, you’re allocating differently than what you think. Essentially this memory mode makes use of cudaMallocManaged calls for memory allocations - which take a lot more time than normal. To compensate against this increased cost, we use pool allocators. This means that we allocate a big block of memory all at once when we see that we don’t have enough memory for the assets we want - and then when you deallocate those assets, we don’t actually free memory, we just unassign it. This allows us to reuse that memory for new assets very quickly without incurring the high cost of actually allocating managed memory again.

This should explain what you’re seeing. You should try to better understand what happens with managed memory under the hood. And if you use nsight systems with memory tracing on your code, you should definitely be able to see the difference in behavior there as well.

matteo.cimini · January 31, 2024, 3:11pm

Ok, now I understand the reason which is reasonable. Thank you so much for the deep explanation, I will definitely try with Nsight to see better what happens.

-Matteo

scamp1 · January 31, 2024, 4:32pm

Glad I could help! Please let us know if you run into any other issues.

system · February 14, 2024, 4:33pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Openacc Example running slower with GPU nvc, nvc++ and nvfortran	7	876	June 19, 2022
OpenACC usage inside OpenMP constructs Legacy PGI Compilers	6	3855	August 26, 2019
Fortran90 / OpenACC / Multi GPU Code Time measure nvc, nvc++ and nvfortran	7	503	February 26, 2024
MPI send + OpenACC + acc_malloc fail with NVFortran, but work with C nvc, nvc++ and nvfortran	10	64	September 6, 2024
Different GPU memory usage between OpenACC and OpenMP Offload nvc, nvc++ and nvfortran	10	827	April 28, 2023
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	675	May 31, 2023
Data copies of the same size vary greatly in different program times nvc, nvc++ and nvfortran	2	323	July 7, 2023
Running HPCX-OpenMPI included in NVIDIA HPC SDK 24.1 causes unusual segfault nvc, nvc++ and nvfortran networking-ucx , openmpi , hpc-x	3	632	February 29, 2024
Using multiple GPUs Legacy PGI Compilers	7	22070	August 11, 2009
Query device resources usage during runtime nvc, nvc++ and nvfortran	5	716	September 1, 2022

OpenACC Multi GPU Memory Informations

Related topics