I am currently investigating a performance discrepancy between binaries compiled with GNU 7.4 and NVIDIA HPC SDK 21, with the NVIDIA HPC SDK prepared binary being around 40% slower on an Intel Skylake system. After some trial and error I have determined that the discrepancy comes from the “-tp=px” option that we are using to ensure backwards compatibility with older CPU architectures.
To improve performance on newer architectures while maintaining support for older systems I attempted to pass multiple values to -tp (-tp=px,skylake,zen), In the older PGI 19.X documentation it seems this should be supported, but I can’t find the same text in the NVIDIA HPC SDK documentation, and it seems to unsupported:
pgc+±Fatal-Switch -tp can only have one keyword value
Is the unified binary approach still supported? What is the current best practice for generating a single binary that supports older CPUs while maintaining competitive performance on newer architectures?