Hello,
I am working on offloading a large MPI/OpenMP Fortran code to GPU using OpenACC. Since the offloaded task is too big data-wise to fit into our GPUs as is, I introduced partial data copy to and from device asynchronously using pinned memory. Still, this task requires careful management of the GPU memory to make it fit onto the GPU.
From other posts, such as this or this, I read that by default the GPU does not deallocate the memory upon encountering the copyout or delete statement but rather adds that memory to the device memory pool. For further context, let’s have a look at a pseudo-code that represents the structure of the GPU-enabled part in question in our case:
!$acc enter data copyin(global_data1,global_data2)
<some CPU operations>
do iz = 1, iz_end
!$acc enter data copyin(p5F_float_holder(:,:,:,:,iz)) async(2)
!$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5F_float_holder(:,:,:,:,iz),global_data1,global_data2) async(1)
do p1 = 1, p1_end
do p2 = 1, p2_end
p3_end = global_data1(p2,p1)
!$acc loop worker vector private(p5_float) collapse(2)
do p3 = 1, p3_end
do p4 = 1, p4_end
p5_float = p5F_float_holder(p4,p3,p2,p1,iz)
p5_end = global_data2(p3,p1)
!$acc loop seq
do p5 = 1, p5_end
<a sequence of numerical operations>
<result> = <a sequence of numerical operations>
p5_float = p5_float + <result>
enddo
p5F_float_holder(p4,p3,p2,p1,iz) = p5_float
enddo
enddo
enddo
enddo
!$acc enter data copyin(p5B_float_holder(:,:,:,:,iz)) async(2)
!$acc wait
!$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5B_float_holder(:,:,:,:,iz),global_data1,global_data2) async(1)
do p1 = 1, p1_end
do p2 = 1, p2_end
p3_end = global_data1(p2,p1)
!$acc loop worker vector private(p5_float) collapse(2)
do p3 = 1, p3_end
do p4 = 1, p4_end
p5_float = p5B_float_holder(p4,p3,p2,p1,iz)
p5_end = global_data2(p3,p1)
!$acc loop seq
do p5 = p5_end,1,-1
<a sequence of numerical operations>
<result> = <a sequence of numerical operations>
p5_float = p5_float + <result>
enddo
p5B_float_holder(p4,p3,p2,p1,iz) = p5_float
enddo
enddo
enddo
enddo
!$acc enter data copyout(p5F_float_holder(:,:,:,:,iz)) async(3) finalize
!$acc wait(1)
!$acc parallel loop collapse(4) default(present) async(1)
do p1 = 1, p1_end
do p2 = 1, p2_end
do p6 = 1, p6_end
do p4 = 1, p4_end
<a sequence of numerical operations>
enddo
enddo
enddo
enddo
!$acc enter data copyout(p5B_float_holder(:,:,:,:,iz)) async(3) finalize
enddo
The code is executed in parallel on several GPUs (and several nodes, potentially), and for each MPI process we have a decomposed part of the data that is omitted here. We have 2 GPU nodes, each equipped with 8 NVIDIA RTX A5000 24GB cards, both nodes being identical, with identical equipment and connected to the master node via identical Infiniband. We are using Rocky Linux 8.7 and HPC SDK 24.5 installed natively via yum. The driver version and CUDA version can be found in the below screenshot:
As shown in the pseudo-code, the idea is to split the transferred data in chunks via the sequential iz loop that runs on the host side and overlap the transfer to and from device with the kernel runs. For freeing the memory, we add the finalize statement to all copyouts. The memory occupancy with just global data (before we start the iz loop) is shown in the screenshot above. Once we run the iz loop, the memory occupancy increases by around 6 GB per card, making it close to the GPU limit. For some problems, this is still enough, but for others, we get an “out of memory” error due to memory overflow. That increased memory value stays even after we quit the iz loop and move to CPU part of our computation. The solution we found is to use the “-x NV_ACC_MEM_MANAGE=0” flag when running the code. If using the flag, the memory occupancy stays as shown in the screenshot even during the iz loop execution. The problem is that this approach works well for one node, let’s call it node1, but when we run the same executable at the identical node2, it performs much worse and slows down a lot (can be a several times difference), while the power consumption and GPU occupancy in nvidia-smi drops from the maximum values to about 100W and 40-50%. What is strange is that after we manually restart node2, it performs just like node1, but after several runs, it starts showing the symptoms again. We found that by removing the NV_ACC_MEM_MANAGE flag, we can get the normal performance out of that node again (and increase the GPU memory occupancy as a tradeoff), but this inconsistency in performance is really bothering us, so we would like to dig deeper and see what is causing it. Once again, we do not experience any slowdown over time with the node1, which is identical to node2.
The compilation flags are: -fast -O3 -mp -cuda -acc -traceback -Minfo=accel -cpp -Mlarge_arrays -Mbackslash -gpu=cc86,deepcopy,cuda12.4,lineinfo,safecache
Our questions are:
- What is the “device memory pool” and how is it different from just free GPU memory?
- Based on the symptoms, it looks like some resource is being filled up during the fresh run on node2. After it fills up completely, the performance drops. What resource could that be, and how to free it up or reset without restarting the node?
- What is the expected behavior of the finalize clause in the context of our code and compilation flags? Can there be some conflicts with the copyout being async or the use of pinned memory?
- What can we do to investigate this problem further?
We would appreciate any insights or suggestions on how to solve that problem, thanks in advance!