Thread bindings with OpenACCx86 and MPI

adrianj · November 25, 2016, 10:09pm

Hi,

Does the PGI OpenACCx86 runtime have any thread binding functionality for OpenACC threads created from MPI processes? I can use the binding functionality in OpenMPI to ensure each MPI process has enough cores for the threads it is going to spawn, but I’m unsure as to how to bind the OpenACC threads to those cores.

thanks

adrianj

MatColgrove · November 28, 2016, 5:56pm

Hi adrianj,

Yes, I’ve been frustrated with OpenMPI’s new hw_loc binding mechanism as well. It doesn’t seem to do what I expect it to do and instead bind every rank’s threads to the same cores. Might be pilot error, but I finally got to the point of disabling it and/or going back to using MPICH with scripts to set binding on a per rank basis.

Our engineer who is responsible for building and packaging the 3rd party applications we ship with the compilers, including OpenMPI, has a task to investigate and write-up a FAQ or PGInsider article on the subject.

What I currently do is something like the following.

ACC_NUM_CORES=16 mpirun -np 2 --bind-to none ./a.out

Or use a script to bind. In this case, I’m using MPICH2’s PMI_RANK env var, but you could use OpenMPI’s OMPI_COMM_WORLD_RANK env variable if using OpenMPI. Of course, your logic will need to change as the number of ranks increase and/or the number of nodes. Also, I’m using numactl to bind, but you could use other methods as well.

$ cat run.sh
export ACC_NUM_CORES=4
echo "$PMI_RANK"
if [ "$PMI_RANK" = 0 ]; then
echo "CASE #1"
numactl --physcpubind=0,1,2,3 ./$1
else
echo "CASE #2"
numactl --physcpubind=8,9,10,11 ./$1
fi
$ mpirun -np 2 sh run.sh a.out

Mat

benry · November 15, 2017, 1:49pm

Hi,
I think I am experiencing similar issues… did you find in the meanwhile a solution to use OpenACC on multicore targets in conjunction with OpenMPI bindings?

In particular, when running on dual-socket machines, I would like to use two different MPI processes each bound to each socket and then let each of them to spawn enough threads to exploit all the cores of the respective CPU.

When running MPI+OpenMP applications with OpenMPI binding I can successfully obtain such behavior launching my application in this way (e.g. for two 8-cores CPUs):

export OMP_NUM_THREADS=8
mpirun -np 2 --bind-to socket --map-by socket --report-bindings ./main

and the reported bindings are exactly as wanted/expected:

MCW rank 0 bound to socket 0 ... : [B/B/B/B/B/B/B/B][./././././././.]
MCW rank 1 bound to socket 1 ... : [./././././././.][B/B/B/B/B/B/B/B]

while running the application, using tools such as htop and taskset, I can clearly see the two processes, bound respectively to the two sockets, spawning 8 threads each, and each thread running in one core.

Trying to do the same using an OpenACC code compiled with PGI with target multicore:

export ACC_NUM_CORES=8
mpirun -np 2 --bind-to socket --map-by socket --report-bindings ./main

lead to the same results up to the bindings report, which seems correct, but then the two processes get bound to the same core of the same socket. All the spawned threads from one MPI process get bound to different cores, but the other MPI process spawns the same number of threads, binding them to the same cores used by threads of the first process. The result is that there are always 2 MPI processes running in the same socket and 2 threads bind to each core of that socket.
Thus, if it is set ACC_NUM_CORES=8 just one socket is used and both the MPI processes with all their threads share the same 8 cores.

Am I doing something wrong?

I think the source of this problem is that on a multi-socket machine acc_get_num_devices(acc_device_host) is just one, also if the sockets are two, thus when acc_set_device_num(0, acc_device_host) is called by the MPI processes both of them use the same device number (i.e. 0) and get bound to the same core.
Is it correct to use “acc_device_host” as device type?
Is it true that a call to acc_set_device_num overrides the bindings reported by OpenMPI?
If it is the case, can this be disabled in some ways in order to let the MPI library to manage the bindings?
Are there any other solutions?

Thanks and Best Regards,

Enrico

benry · November 15, 2017, 4:18pm

I found one solution, which is to set the ACC_BIND variable (enabled by default when using -ta=multicore) to no:

export ACC_NUM_CORES=8
export ACC_BIND=no
mpirun -np 2 --bind-to socket --map-by socket --report-bindings ./main

this avoids to override the OpenMPI bindings, thus at least processes and respective threads runs in the correct sockets… threads still migrates inside the same socket (which is bad for cache data reuse), but its better than before.

p.s.
If there are other/better solutions, please let me know…

Best Regards,

Enrico

Topic		Replies	Views
Performance with hybrid setup Legacy PGI Compilers	6	798	March 18, 2022
OpenACC usage inside OpenMP constructs Legacy PGI Compilers	6	3858	August 26, 2019
Is there a general template to write hybrid MPI and openACC? Legacy PGI Compilers	9	926	November 18, 2020
Using multiple GPUs Legacy PGI Compilers	7	22075	August 11, 2009
OpenMPI + OpenMP problem Legacy PGI Compilers	6	6772	September 28, 2016
OpenMP + OpenACC problem Legacy PGI Compilers	9	5262	April 17, 2019
Multi-GPU MPI launch failing when UVM enabled Legacy PGI Compilers	5	3768	January 2, 2019
parallel computation by using pgcc compiler in dual core mac Legacy PGI Compilers	6	9615	May 19, 2006
Failure when using OpenACC after MPI_Init nvc, nvc++ and nvfortran	7	1568	April 23, 2021
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10774	March 15, 2013

Thread bindings with OpenACCx86 and MPI

Related topics