I have a pgi+openacc code and I am trying to set it to use unified memory.
I read online that in order to activate unified memory, I do have to set the flag “-ta=tesla:managed” .
Is it all? or do I have to change my mallocs to cudaMallocManaged? and should I strip the copyin/copy/copyout calls from my “#pragma acc” statements?
I ask because, after settng -ta=tesla:managed, the nvprofiler still shows Host->Device and Device->Host copies of data with identical timings (of both copies and the kernel).
Or could it be that nvprofile somehow disables the unified memory (sorry for the trivial question, I am very new to this)
What happens when you enable this option, the compiler will replace all of your malloc/new/allocate calls with the “managed” version. We use a manged pool allocator so the code is not calling cudaMallocManaged directly, but it will be managed.
and should I strip the copyin/copy/copyout calls from my “#pragma acc” statements?
No need to do this. The compiler runtime will check if the variable is managed or not. If it is managed, then the data clause is essentially ignored.
the nvprofiler still shows Host->Device and Device->Host copies of data with identical timings (of both copies and the kernel).
Without specifics, it’s difficult to say exactly why this is happening. However, keep in mind that managed memory is only currently available for use with dynamic data. So if you’re code is using fixed size arrays or objects, then these object still need to be manually managed.
Also, you need to make sure that you link with “-ta=tesla:managed” as well. Otherwise the runtime check to see if it’s managed isn’t used.
For manged memory, the profiler should have a row which shows the relative “heat” of the page migration between the host and device. It wont show the individual data copies like it does with the data regions.
Or could it be that nvprofile somehow disables the unified memory
It’s enabled by default, but it is possible to disable it when you create a session. Plus you need a device that’s capable of supporting unified memory.
I’m coding a program which runs on multi GPUs in a single node, I want to use the managed memory to allocate the variables on different GPUs, and use the standard parallel fortran. Could you give us an example in this situation ?
CUDA Unified Memory (aka “managed”) works across multiple GPUs on a system (see: HERE), so you don’t really need to do anything special. This would be if you’re using a single process to manage multiple GPUs, such as via OpenMP or programmatically changing the devices.
Though I prefer to use MPI for multi-gpu programming. In which case each rank has it’s own CUDA context and therefore it’s own CUDA Unified address space. So again, not a problem but the UM is not shared across ranks.