OpenMP thread affinity

Hello,

I am looking to find an easy alternative for KMP_AFFINITY flags from Intel OpenMP - which allows one to customize thread mapping to cpu core(s). The MP_BIND and MP_BLIST are not very customizeable in that respect, as one has to manually feed in the core IDs, as opposed to KMP_AFFINITY keywords that do this for us.

In my particular case, I am looking for a fairly simple mapping of running 2 MPI processes per dual-socket node, one process per socket, and then OpenMP within that socket. It turns out that if I bind the OpenMP threads to cores, my application gets about 20% speedup, so, that’s not insignificant.

After fruitless web search I ended up spending a couple of hours writing a script that produces the MP_BLIST for each socket. I have to admit that I was surprised by not finding any script already since the GNU compilers, or taskset flags, use the same strategy.

The matter is more complicated since some NUMA nodes number the cpus sequentially and some round-robin (as can be seen from “numactl -H”).

I would appreciate 2 things:

  1. If you don’t have a script like this, please, scrutinize the one I paste below and post it somewhere as I figure it’d be useful for many. My script was tested just for 2 tasks per node (set by SLURM_STEP_TASKS_PER_NODE variable), but it should be fairly easily extendable to more tasks per node

  2. If you have something like this already (arguably better), please, post it somewhere where one can find it easily.

If you have any other feedback that could be useful, I’d appreciate hearing it.

Here’s the script

#!/bin/bash

# determine procs/threads per node
TPN=`echo $SLURM_STEP_TASKS_PER_NODE | cut -f 1 -d \(`
PPN=`cat /proc/cpuinfo | grep processor|wc -l`
NTHR=$((PPN/TPN))

#determine NUMA mapping
MAP=(`numactl -H | grep "node 0 cpus" | cut -d : -f 2`)
if (( ${MAP[0]} == "0" && ${MAP[1]} == "1" )); then
 if (( $PMI_RANK == 0 )); then
   echo NUMA core mapping is sequential
 fi
 NUMAMAP=0
else
 if (( $PMI_RANK == 0 )); then
   echo NUMA core mapping is round-robin
 fi
 NUMAMAP=1
fi

SPACE=","
CTR=1
# create CORELIST mapping, first for round-robin NUMA core mapping
if (( $NUMAMAP == 1 )); then
  if (( $PMI_RANK % 2 == 0 )); then
    CORE=0
  else
    CORE=1
  fi
  CORELIST=$CORE
  let CORE=CORE+2
  while [ $CTR -lt $NTHR ]; do
    CORELIST=$CORELIST$SPACE$CORE
    let CTR=CTR+1
    let CORE=CORE+2
  done
else # here's CORELIST for sequential NUMA core mapping
  NTHR2=$((NTHR/2)) # assume hyperthreading is on
  if (( $PMI_RANK % 2 == 0 )); then
    CORE=0
  else
    CORE=$NTHR2
  fi
  CORELIST=$CORE
  let CORE=CORE+1
  while [ $CTR -lt $NTHR2 ]; do  # first fill the first hypercore set
    CORELIST=$CORELIST$SPACE$CORE
    let CTR=CTR+1
    let CORE=CORE+1
  done
  let CORE=CORE+NTHR2
  CTR=0
  while [ $CTR -lt $NTHR2 ]; do  # now fil the second hypercore set
    CORELIST=$CORELIST$SPACE$CORE
    let CTR=CTR+1
    let CORE=CORE+1
  done
fi

# set the PGI environment variables and run whatever needs to be run
export OMP_PROC_BIND=true #export MP_BIND="yes" # MP_BIND is equivalent to OMP_PROC_BIND
export MP_BLIST=$CORELIST
echo MPI rank $PMI_RANK maps to cores $CORELIST
$*

I have filed TPR 23407 to request this capability be evaluated for possible
inclusion in the future.

dave