PGI 6.0 on AMD64: numa and numactl ?

I am using PGI 6.0-5 compilers and CDK on AMD64 platform,
under a Linux kernel 2.6.9 (Red Hat 4.0 derived).
As for my previous messages on this forum, I have
some performance issues when comparing results
under the older kernel 2.4.
Since the codes I am testing are MPI-based rather than
OpenMP or autopar, numa option from the compiler
is ineffective, if I understand the release notes correctly.
Therefore, I would like to test the numactl tools and
library in order to squeeze more performance
from the system (HW and SW stack).

Do you have any suggestion regarding the usage
of numactl together with PGI 6.0 on AMD64 architecture?
Are they largely independent, so that I can safely
experiment with numactl options, or maybe do they
interact in some subtle way ?



Hi cmn,

The numactl tool is independent from the compilers, however, you can use the “-mp=numa” flag during linking (even for non-OpenMP programs) to have your program linked with the system’s NUMA libraries. However either method, numactl or “-mp=numa”, will have mostly the same effect on your program. The main difference is how you’ll run your program.

With numactl, you simply set the environment variable ‘NCPUS’ to the number of threads to run and then run your program with ‘numactl myprog.exe’. “-c n”, specifies the number of nodes to run on. In a muti-CPU system 1 node equals 1 CPU. In a Dual-code system, 1 node equals 2 CPUs. If you need finer grain control for dual-core systems, you’ll also need to use 'taskset ’ in order to set a specific CPU. Numactl’s “-m n” flag indicates which node to lock your program’s memory. Typically, this is the same node(s) that you have specified with “-c”. Instead of locking the memory to a particular node, you can instead specify “–interleave” to have the memory interleaved across all available nodes. This can help memory bound codes that need a lot of throughput but in general you should lock the memory to a node.

If your using “-mp=numa”, you set the NCPUS as before, but instead use environment variables “MP_BIND” and “MP_BLIST” to tell the runtime that the program should be bound to a node and which nodes to bind the program to. The syntax is “MP_BIND=yes|no” to indicate if the program should be bound or not (the default is no), and “MP_BLIST=0,1,2,…” sets which CPUs to set the threads. There is no equivalent to “–interleave”. More on this can be found in the 6.0 release notes.

My experience using numactl has largely been with auto-parallelization (-Mconcur) running on a Quad dual-core system (8 CPUs). So unfortunately, I don’t have much advice when running with MPI other to say I don’t think either method would help since both are for use on a single multi-CPU system rather than a cluster. If you went with a hybrid model (MPI/OpenMP) then there might be some use, but other than that, I doubt it.

  • Mat