Performance with hybrid setup

I was performing some tests for CPU-only systems with v22.2 in order to compare MPI versus hybrid setup. However, the latter results are coming back much slower. The hybrid compilation adds the flags -mp -Minfo=mp -Mrecursive. The system is a chiplet design (from AMD) so my initial explanation was that the threads are forked out such that the same MPI rank is using cores from different CCXs, but mapping out based on l3 didn’t improve performance.
Does the NVHPC Team has any reference or documentation showing how the use of a hybrid setup with several threads per MPI rank can accelerate computations? My hope is that maybe by looking at something that works, I can figure out which is the problem.
[Disclosure: I’ve tried similar comparison with GCC for the same system and didn’t see this problem, as a matter of fact the hybrid compilation was running a bit faster]

Hi afernandez,

If I understand correctly, you have a hybrid MPI+OpenMP enabled program with OpenMP targeting the host CPU but seeing slower results than a with a pure MP version?

First which MPI are you using? OpenMPI by default will use “–bind-to cores” which binds each MPI rank to a single core. Hence all OpenMP threads will be bound to the same core thus causing a huge amount of contention. There are other binding strategies you can use, such as “–bind-to socket”, but personally, I always disable OpenMP’s binding when running MPI+OpenMP via the “–bind-to node” mpirun option, and then either set “OMP_PROC_BIND=true” or use “numactl” to set the thread binding.

For AMD CPU systems, I’ve found using multiple ranks, 1 per socket, and setting OMP_NUM_THREADS to the cores per socket, better for performance. Binding is a bit more challenging but usually use a wrapper script to set the bindings per rank, either with OMP_PLACES or the “numactl” utility.

For example, this is a perl script I use for MPI+OpenACC targeting 8 GPUs. Feel free to modify for your use case. For OpenMP you should only need to change the “core_map” to give a range… I just run with “mpirun -np 8 --bind-to none perl …command…”.

my %core_map = (
  0=>'48', 1=>'56', 2=>'16', 3=>'24', 4=>'112', 5=>'120', 6=>'80', 7=>'88'
my %mem_map = (
  0=>3, 1=>3, 2=>1, 3=>1, 4=>7, 5=>7, 6=>5, 7=>5,
my $core = $core_map{$rank};
my $mem = $mem_map{$rank};
my $cmd = "numactl -C $core -m $mem ";
while (my $arg = shift) {
       $cmd .= "$arg ";
#print "$cmd\n";

Hope this helps,

Hi Mat,
I’m comparing hybrid versus MPI (as you mentioned using pure OpenMP would be a disaster but that is not the situtation). The MPI wrapper is OpenMPI v4.1.2 and I’m setting OMP_PROC_BIND=TRUE, OMP_PLACES=cores and OMP_NUM_THREADS=8, which agrees with your suggestions. Yet, jobs with this configuration is what run pretty slowly.

Are you using multiple ranks? If so, OMP_PLACES=cores would have each rank bind to the same set of cores. Hence you’d want to set OMP_PLACES to the particular cores for each rank.

Also, are you setting “–bind-to none” on the mpirun command? Otherwise OpenMPI’s binding to a single core would take precedence.


1 Like

Hi Mat,
Yes, I’m running tests with 6 MPI ranks (and 8 threads per rank) but the ultimate objective would be to run with hundreds of ranks. I’ve tried and tried but something is just not clicking. When I tried with -bind-to none, it unexpectedly crashed with the error message:

mpirun noticed that process rank 0 with PID 0 on node ip-172-31-15-253 exited on signal 6 (Aborted).


Odd. I can’t think of any reason why not applying binding by mpirun would cause a bus error. Maybe it exposed an issue in the application, perhaps because of a timing issue given the threads now able to run in concurrently?

OK. I don’t have an explanation either. I’ll wait for the next release and try again as I cannot afford spending more time right now.