Hi,
I think I am experiencing similar issues… did you find in the meanwhile a solution to use OpenACC on multicore targets in conjunction with OpenMPI bindings?
In particular, when running on dual-socket machines, I would like to use two different MPI processes each bound to each socket and then let each of them to spawn enough threads to exploit all the cores of the respective CPU.
When running MPI+OpenMP applications with OpenMPI binding I can successfully obtain such behavior launching my application in this way (e.g. for two 8-cores CPUs):
export OMP_NUM_THREADS=8
mpirun -np 2 --bind-to socket --map-by socket --report-bindings ./main
and the reported bindings are exactly as wanted/expected:
MCW rank 0 bound to socket 0 ... : [B/B/B/B/B/B/B/B][./././././././.]
MCW rank 1 bound to socket 1 ... : [./././././././.][B/B/B/B/B/B/B/B]
while running the application, using tools such as htop and taskset, I can clearly see the two processes, bound respectively to the two sockets, spawning 8 threads each, and each thread running in one core.
Trying to do the same using an OpenACC code compiled with PGI with target multicore:
export ACC_NUM_CORES=8
mpirun -np 2 --bind-to socket --map-by socket --report-bindings ./main
lead to the same results up to the bindings report, which seems correct, but then the two processes get bound to the same core of the same socket. All the spawned threads from one MPI process get bound to different cores, but the other MPI process spawns the same number of threads, binding them to the same cores used by threads of the first process. The result is that there are always 2 MPI processes running in the same socket and 2 threads bind to each core of that socket.
Thus, if it is set ACC_NUM_CORES=8 just one socket is used and both the MPI processes with all their threads share the same 8 cores.
Am I doing something wrong?
I think the source of this problem is that on a multi-socket machine acc_get_num_devices(acc_device_host) is just one, also if the sockets are two, thus when acc_set_device_num(0, acc_device_host) is called by the MPI processes both of them use the same device number (i.e. 0) and get bound to the same core.
Is it correct to use “acc_device_host” as device type?
Is it true that a call to acc_set_device_num overrides the bindings reported by OpenMPI?
If it is the case, can this be disabled in some ways in order to let the MPI library to manage the bindings?
Are there any other solutions?
Thanks and Best Regards,
Enrico