I have a hybrid code with MPI + OpenMP + OpenACC using ta=multicore.
When I run the code I encounter extreme low performance. The code does not contain nested regions of OpenMP and OpenACC.
When I replace the OpenACC sections by OpenMP directives and remove ta=multicore I get the desired performance.
I think it is related to thread affinity but I’m unable figure it out. Do I need to set additional flags/environment variables?
I’ll need to do a bit of research on this one. I’ve not tried mixing OpenMP and OpenACC targeting multicore together myself and not entirely sure of the interaction of the two runtimes.
Also I see you are using MKL which could also be using OpenMP. Are you calling MKL from any OpenACC regions?
If at all possible, a reproducing example would be very welcome to ensure that I’m am to replicate and then investigate the issue here.
thank you for your answer. I experimented a little bit and found that if I pass –bind-to socket to mpirun it works correctly.
No the MKL is not called from OpenACC.
I have another question regarding ACC_NUM_CORES and OMP_NUM_THREADS. What is the relation between these two variables?
When users run a hybrid code they have to set ACC_NUM_CORES to the number of cores and OMP_NUM_THREADS to the number of threads? When ACC_NUM_CORES is not set, what kind of default value is used?