Query device resources usage during runtime

p-costa · August 31, 2022, 10:47am

Hi all,

While doing the bookkeeping of device data on my OpenACC Fortran code, I was wondering if there is a way to get the sum of the storage for all allocated device data. From NVIDIA’s extension to the OpenACC runtime API, I was able to nicely use acc_get_memory and acc_get_free_memory to get the total device resource usage, and acc_bytesalloc to get the “total bytes allocated by data or compute regions”. However, I miss a way of getting the contribution of OpenACC arrays which are mapped to CUDA device arrays. These are listed when I use acc_present_dump, but I miss a neat way of getting the sum of their memory footprint.

I am interested in getting this sum because from this post I realize that using acc_get_memory and acc_get_free_memory may result in an unexpected reporting if I do not disable the runtime memory manager?

Here is a small example illustrating what I am looking for:

! cat test.f90
program p
  use accel_lib
  real, device, allocatable, dimension(:,:,:) :: a_d
  real,         allocatable, dimension(:,:,:) :: a_h
  allocate(a_d(10,10,10),a_h(10,10,10))
  call acc_map_data(a_h,a_d,sizeof(a_h))
  call acc_present_dump   ! will report the mapped device array
  print*,acc_bytesalloc() ! will print zero
end program p

nvfortran -acc -cuda test.f90 && ./a.out
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 6.1, threadid=1
host:0x39b8820 device:0x7f41c7e00000 size:4000 presentcount:0+1 line:-1 name:(null)
                        0

Thank you!

MatColgrove · August 31, 2022, 10:02pm

Hi p-costa,

You may need to use the standard OpenACC “acc_get_property” API instead. “acc_bytesalloc” is only going to show the allocation done by the OpenACC runtime, not CUDA.

Note that CUDA Fortran has a similar optimization where it doesn’t free device memory by default and instead will try to reuse this memory. I’m not sure if we have a way to disable this as we do in OpenACC, but I have asked some folks who should know.

Note that under the hood, “acc_get_property” uses cudaMemGetInfo.

Here’s an example:

% cat test.f90
! cat test.f90
program p
  use accel_lib
  use openacc
  use cudafor
  real, device, allocatable, dimension(:,:,:) :: a_d
  real,         allocatable, dimension(:,:,:) :: a_h
  integer(c_size_t) :: free_mem_start
  integer(c_size_t) :: free_mem
  integer :: devNum
  integer(acc_device_kind) :: devType

  devType = acc_get_device_type()
  devNum=acc_get_device_num(devType)
  print*,"Using device: ", devNum
  free_mem_start = acc_get_property(devNum,devType,acc_property_free_memory)

  print *, "Start Free Mem: ", free_mem_start
  allocate(a_d(10,10,10),a_h(10,10,10))
  call acc_map_data(a_h,a_d,sizeof(a_h))
  call acc_present_dump   ! will report the mapped device array
  print*,acc_bytesalloc() ! will print zero
  free_mem = acc_get_property(devNum,devType,acc_property_free_memory)
  print *, "Allocated: ", free_mem_start-free_mem
  call acc_unmap_data(a_h)
  deallocate(a_d)
  deallocate(a_h)
  free_mem = acc_get_property(devNum,devType,acc_property_free_memory)
  print *, "After deallocate: ", free_mem_start-free_mem


end program p
% nvfortran -acc -cuda test.f90; a.out
 Using device:             0
 Start Free Mem:               84761837568
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.0, threadid=1
host:0x30c1de0 device:0x1501432fa000 size:4000 presentcount:0+1 line:-1 name:(null)
                        0
 Allocated:                  10485760
 After deallocate:                  10485760

Hope this helps,
Mat

p-costa · September 1, 2022, 10:28am

Thank you very much, Mat. As always, this is very helpful.

So I’d guess that PGI/NVIDIA’s acc_get_memory and acc_get_free_memory may be wrappers for acc_get_property, doing the equivalent of what you wrote above…

I see that, indeed, free_mem_start-free_mem is quite different than 10*10*10*4 = 4000 which is reported by acc_present_dump because it may be more efficient just to allocate more memory in one go? I get something similar but smaller for my (older and cheaper) card.

So I guess there is no routine that will return the sum of all allocations reported by acc_present_dump?

Thank you again,
Pedro

MatColgrove · September 1, 2022, 5:51pm

While I’m not 100% sure, I think what’s happening is that the CUDA runtime is allocating a page at time which accounts for the extra memory.

So I guess there is no routine that will return the sum of all allocations reported by acc_present_dump ?

This is what acc_bytesalloc is for, but for OpenACC runtime allocations, not CUDA allocations. I can submit an RFE, but first, what is the use case here? Are you trying just account for how much device memory the program allocates or are you going to use this information to make decisions in the program (such as a blocking algorithm)?

Thanks,
Mat

p-costa · September 1, 2022, 6:20pm

Thank you, Mat.

Are you trying just account for how much device memory the program allocates

Yes, this functionality is not critical for my application. I was just looking for a way to validate my estimator of the device memory footprint, to see how far I can push my computational setups so they just fit the GPU memory on the supercomputer nodes. I thought that, since acc_present_dump reports both (mapped) CUDA and OpenACC allocations, there would be perhaps a subroutine to report the whole memory footprint, similarly to acc_bytesalloc.

Pedro

MatColgrove · September 1, 2022, 7:15pm

I went ahead and added an RFE, TPR #32360, requesting that the value returned by acc_bytesalloc return the allocation size as seen in the present table as opposed to what’s recorded by the OpenACC runtime allocator.

I did give it a low priority, but hopefully it’s something easy, that if an engineer has a few spare cycles can implement. No guarantees though.