I have a CFD solver that is accelerated with OpenACC and parallelized with MPI. Without going into too much detail, let’s say per GPU that I have, I am able to have 16 million cells in my domain without any slowdown issues, IF I am running a single GPU simulation. If I move to multi-GPU, I am seeing significant slowdown. I thought I tracked down the issue to allocating too much private memory in a specific subroutine in my code. It was fixed for a bit, but now the problem came back.
I am very confused. If I check
nvidia-smi, I see the following:
My application is bin/dew. In the GPU memory usage, it is only showing 3034MB per GPU. But for some reason the total memory usage above, is close to full capacity. I am really unsure what is going on.
Any advice appreciated!
Does your program use CUDA Unified Memory, i.e. you compile with -gpu=managed?
UM wont show up as part of the program’s memory but will in the the total memory.
Also, UM can oversubscribe GPU memory, so if you use more than what’s available it will get paged back to the host. While convenient, it can cause slow downs if the memory gets paged back and forth.
Thanks for your response. Yes, I use Unified Memory. I think what you described is happening. So my (easier) options are to decrease the total memory that I allocate globally or to decrease my mesh size? The problem is likely not related to private memory allocation?
So my (easier) options are to decrease the total memory that I allocate globally or to decrease my mesh size?
Difficult for me to say since I don’t know your code. Though consider running the code through Nsight-Systems to get a profile to better understand how memory is affecting your performance.
The problem is likely not related to private memory allocation?
Likely, but again I don’t know for sure.
In general, I much prefer manually managing data and not using UM for MPI codes. CUDA Aware MPI can’t currently take advantage of GPU direct communication when using UM. If possible, you may consider spending the time adding data directives as well as host_data directives around your MPI calls. Granted, if the program doesn’t have a lot of MPI communication, it may not matter, so it’s up to you if it’s worth the time investment.