Unified binary creation with -tp and NVIDIA HPC SDK 21.2

Hello,

I am currently investigating a performance discrepancy between binaries compiled with GNU 7.4 and NVIDIA HPC SDK 21, with the NVIDIA HPC SDK prepared binary being around 40% slower on an Intel Skylake system. After some trial and error I have determined that the discrepancy comes from the “-tp=px” option that we are using to ensure backwards compatibility with older CPU architectures.

To improve performance on newer architectures while maintaining support for older systems I attempted to pass multiple values to -tp (-tp=px,skylake,zen), In the older PGI 19.X documentation it seems this should be supported, but I can’t find the same text in the NVIDIA HPC SDK documentation, and it seems to unsupported:

pgc+±Fatal-Switch -tp can only have one keyword value

Is the unified binary approach still supported? What is the current best practice for generating a single binary that supports older CPUs while maintaining competitive performance on newer architectures?

Thanks,

-David

Unfortunately no. This went away when we moved to the LLVM based back-end compiler. Hopefully we can bring it back at some point, but it wasn’t a widely used feature so not a high priority item.

What is the current best practice for generating a single binary that supports older CPUs while maintaining competitive performance on newer architectures?

Depend on how far back do you need to support. You’d want to use the oldest CPU you think your end-users would have, something like -tp=sandybridge or -tp=haswell.

Hi Mat,

Thanks for the quick response. Indeed, right now i’m building with a few different -tp options to see how it performs. I can’t imagine any of our customers using something older than Haswell, but i’ve been surprised before…

I’ll let you know if I run into any other problems,

Thanks again,

-David

Hello Mat,

Thanks for the tips. I ended using -tp=sandybridge, which should give us sufficient backwards compatibility while matching the GNU speed.

I have a related followup with packaging for multiple CUDA versions. We absolutely want to have support for CUDA 11 / Ampere hardware, but at the same time I know that most of our users are still running CUDA 10. I encounter failures at runtime if I build on a system with CUDA 11 and then try to run on a system with CUDA 10. Is it possible to support multiple versions of CUDA with a single binary?

Thanks,

-David

Unfortunately, no. You can target multiple devices within the same binary, but need to use the same base CUDA version. However, Ampere based targets require CUDA 11.

Hi Mat,

Thanks for the response. To make I’m sure I’m really clear on this: if we want to support both Ampere systems running CUDA 11 and Volta systems running CUDA 10 there is no choice but to package duplicate binaries?

If that is case I’ll have to rapidly bring this up with our product managers as we have an upcoming release that will have to be adapted. I don’t think our customers will be willing to upgrade to the latest CUDA, they are chronically slow to upgrade.

Thanks,

-David

Hi Mat,

Building on my last comment: I suppose there is no way to include a stripped down CUDA 11 runtime with our package that would supersede an older CUDA installation on the user’s system?

Thanks,

-David

If your customers are unwilling to update their CUDA Drivers, then yes.

Thanks Mat.

Definitely better to know this now than a few months from now when the customers get ahold of it.

-David