Thread bindings with OpenACCx86 and MPI

Hi,

Does the PGI OpenACCx86 runtime have any thread binding functionality for OpenACC threads created from MPI processes? I can use the binding functionality in OpenMPI to ensure each MPI process has enough cores for the threads it is going to spawn, but I’m unsure as to how to bind the OpenACC threads to those cores.

thanks

adrianj

Hi adrianj,

Yes, I’ve been frustrated with OpenMPI’s new hw_loc binding mechanism as well. It doesn’t seem to do what I expect it to do and instead bind every rank’s threads to the same cores. Might be pilot error, but I finally got to the point of disabling it and/or going back to using MPICH with scripts to set binding on a per rank basis.

Our engineer who is responsible for building and packaging the 3rd party applications we ship with the compilers, including OpenMPI, has a task to investigate and write-up a FAQ or PGInsider article on the subject.

What I currently do is something like the following.

ACC_NUM_CORES=16 mpirun -np 2 --bind-to none ./a.out

Or use a script to bind. In this case, I’m using MPICH2’s PMI_RANK env var, but you could use OpenMPI’s OMPI_COMM_WORLD_RANK env variable if using OpenMPI. Of course, your logic will need to change as the number of ranks increase and/or the number of nodes. Also, I’m using numactl to bind, but you could use other methods as well.

$ cat run.sh
export ACC_NUM_CORES=4
echo "$PMI_RANK"
if [ "$PMI_RANK" = 0 ]; then
echo "CASE #1"
numactl --physcpubind=0,1,2,3 ./$1
else
echo "CASE #2"
numactl --physcpubind=8,9,10,11 ./$1
fi
$ mpirun -np 2 sh run.sh a.out
  • Mat

Hi,
I think I am experiencing similar issues… did you find in the meanwhile a solution to use OpenACC on multicore targets in conjunction with OpenMPI bindings?

In particular, when running on dual-socket machines, I would like to use two different MPI processes each bound to each socket and then let each of them to spawn enough threads to exploit all the cores of the respective CPU.

When running MPI+OpenMP applications with OpenMPI binding I can successfully obtain such behavior launching my application in this way (e.g. for two 8-cores CPUs):

export OMP_NUM_THREADS=8
mpirun -np 2 --bind-to socket --map-by socket --report-bindings ./main

and the reported bindings are exactly as wanted/expected:

MCW rank 0 bound to socket 0 ... : [B/B/B/B/B/B/B/B][./././././././.]
MCW rank 1 bound to socket 1 ... : [./././././././.][B/B/B/B/B/B/B/B]

while running the application, using tools such as htop and taskset, I can clearly see the two processes, bound respectively to the two sockets, spawning 8 threads each, and each thread running in one core.

Trying to do the same using an OpenACC code compiled with PGI with target multicore:

export ACC_NUM_CORES=8
mpirun -np 2 --bind-to socket --map-by socket --report-bindings ./main

lead to the same results up to the bindings report, which seems correct, but then the two processes get bound to the same core of the same socket. All the spawned threads from one MPI process get bound to different cores, but the other MPI process spawns the same number of threads, binding them to the same cores used by threads of the first process. The result is that there are always 2 MPI processes running in the same socket and 2 threads bind to each core of that socket.
Thus, if it is set ACC_NUM_CORES=8 just one socket is used and both the MPI processes with all their threads share the same 8 cores.

Am I doing something wrong?

I think the source of this problem is that on a multi-socket machine acc_get_num_devices(acc_device_host) is just one, also if the sockets are two, thus when acc_set_device_num(0, acc_device_host) is called by the MPI processes both of them use the same device number (i.e. 0) and get bound to the same core.
Is it correct to use “acc_device_host” as device type?
Is it true that a call to acc_set_device_num overrides the bindings reported by OpenMPI?
If it is the case, can this be disabled in some ways in order to let the MPI library to manage the bindings?
Are there any other solutions?


Thanks and Best Regards,

Enrico

I found one solution, which is to set the ACC_BIND variable (enabled by default when using -ta=multicore) to no:

export ACC_NUM_CORES=8
export ACC_BIND=no
mpirun -np 2 --bind-to socket --map-by socket --report-bindings ./main

this avoids to override the OpenMPI bindings, thus at least processes and respective threads runs in the correct sockets… threads still migrates inside the same socket (which is bad for cache data reuse), but its better than before.

p.s.
If there are other/better solutions, please let me know…


Best Regards,

Enrico