Overload when using managed mempory flag in openacc


I am trying to use the unified memory managed in my code but first I tried to add the flag while compiling as below without adding any pragmas directives in the code.

-fast -acc -ta=tesla:managed -Minfo=accel -Mcuda -lnvToolsExt -O3 -Wall -std=c++0x -fPIC -I$(PROJECT_PATH)

I noticed that the GPUs become active as below and start doing transfers and the code become too slow when compiled

When I profile there is transfers but I dont know what is the reason for it as below

what could be the reason of the problem from your opinion?

Thanks for your time

A CUDA Context will still be created and why you’re seeing the binaries in the nvidia-smi report.

As for the data movement, I’m not sure. Given that there is 8 processes, does the program use MPI? If so, it may be some initialization MPI is doing in order to support CUDA Aware MPI. Or there might be some global parameters being implicitly copied.

If you set the environment variable “NV_ACC_NOTIFY=3”, does the output show any data movement?


Thanks for your reply, Yes the program uses MPI and I activated NV_ACC_NOTIFY=3 and it didn’t show any data movement. I don’t know what is causing the data movement and it’s taking a lot of time. Is there any other way to check what is causing this? I am kind of new to OpenACC stuff so it’s a bit confusing for me to catch up on what could be going on. Your help will be appreciated.


If it’s not showing up under the “NV_ACC_NOTIFY”, then the data movement is not coming from the OpenACC runtime. I would suspect it’s coming from the MPI library.

Next step would be to run the program through Nsight-systems adding “-trace cuda,openacc,mpi”. This will add details on the MPI communication and OpenACC API calls. Also, if you view the timeline in the GUI you can see when the data movement is occurring and might give clues as to where the data movement is coming from.