Unified binary creation with -tp and NVIDIA HPC SDK 21.2

Hello,

I am currently investigating a performance discrepancy between binaries compiled with GNU 7.4 and NVIDIA HPC SDK 21, with the NVIDIA HPC SDK prepared binary being around 40% slower on an Intel Skylake system. After some trial and error I have determined that the discrepancy comes from the “-tp=px” option that we are using to ensure backwards compatibility with older CPU architectures.

To improve performance on newer architectures while maintaining support for older systems I attempted to pass multiple values to -tp (-tp=px,skylake,zen), In the older PGI 19.X documentation it seems this should be supported, but I can’t find the same text in the NVIDIA HPC SDK documentation, and it seems to unsupported:

pgc+±Fatal-Switch -tp can only have one keyword value

Is the unified binary approach still supported? What is the current best practice for generating a single binary that supports older CPUs while maintaining competitive performance on newer architectures?

Thanks,

-David

Unfortunately no. This went away when we moved to the LLVM based back-end compiler. Hopefully we can bring it back at some point, but it wasn’t a widely used feature so not a high priority item.

What is the current best practice for generating a single binary that supports older CPUs while maintaining competitive performance on newer architectures?

Depend on how far back do you need to support. You’d want to use the oldest CPU you think your end-users would have, something like -tp=sandybridge or -tp=haswell.

Hi Mat,

Thanks for the quick response. Indeed, right now i’m building with a few different -tp options to see how it performs. I can’t imagine any of our customers using something older than Haswell, but i’ve been surprised before…

I’ll let you know if I run into any other problems,

Thanks again,

-David

Hello Mat,

Thanks for the tips. I ended using -tp=sandybridge, which should give us sufficient backwards compatibility while matching the GNU speed.

I have a related followup with packaging for multiple CUDA versions. We absolutely want to have support for CUDA 11 / Ampere hardware, but at the same time I know that most of our users are still running CUDA 10. I encounter failures at runtime if I build on a system with CUDA 11 and then try to run on a system with CUDA 10. Is it possible to support multiple versions of CUDA with a single binary?

Thanks,

-David

Unfortunately, no. You can target multiple devices within the same binary, but need to use the same base CUDA version. However, Ampere based targets require CUDA 11.

Hi Mat,

Thanks for the response. To make I’m sure I’m really clear on this: if we want to support both Ampere systems running CUDA 11 and Volta systems running CUDA 10 there is no choice but to package duplicate binaries?

If that is case I’ll have to rapidly bring this up with our product managers as we have an upcoming release that will have to be adapted. I don’t think our customers will be willing to upgrade to the latest CUDA, they are chronically slow to upgrade.

Thanks,

-David

Hi Mat,

Building on my last comment: I suppose there is no way to include a stripped down CUDA 11 runtime with our package that would supersede an older CUDA installation on the user’s system?

Thanks,

-David

If your customers are unwilling to update their CUDA Drivers, then yes.

Thanks Mat.

Definitely better to know this now than a few months from now when the customers get ahold of it.

-David

Hello Mat,

sorry to revive this old thread. I’m seeing some odd behavior with one of our release builds that I just don’t understand. This is related to the discussion of CUDA 10 / 11 we have a few weeks ago.

For this application we build with a slightly older version of the PGI compiler, V19.4, on a system without a native CUDA installation. It seems PGI is thus defaulting to linking with CUDA 10.1, which is what I expected. This application runs on any system I have tried with a CUDA 10.X installation (sample nvidia-smi output for one of these systems shown below):

| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |

However, when I try to run on a system with CUDA 11 it fails at the first OpenACC API call, which unsurprisingly is acc_get_num_devices(acc_device_nvidia). The system where it is failing is using driver version 450.102 and CUDA version 11.0:

| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |

Everything I’ve read online seems to indicate the applications built with an older version of CUDA should be compatible with newer CUDA runtimes. Am I missing something?

Thanks for your help

-David

No, CUDA drivers are backwards compatible so I would expect a CUDA 10.2 built binary to be able to be run using a CUDA 11.0 driver.

While I don’t have these exact CUDA driver versions, I tried building one of my programs on a CUDA 10.2 system (440.33.01) with “-gpu=cuda10.2,cc60,cc70 -tp penryn” using a P100, then ran successfully on a system with a V100 with CUDA 11.0 (450.51.06).

What’s the actual error? Could it be failing due to a missing dependent library?

Hello Mat,

That is the weird part, the only error is “libgomp: TODO”, which means nothing to me. The exact code being called by the solver is:

std::cout << "DAVID - BEFORE..." << std::endl;
int nbAttachedDevices  = acc_get_num_devices(acc_device_nvidia);
std::cout << "DAVID - AFTER..." << std::endl;

…and the output from the solver, running in serial, before returning to the terminal is the following. There is no other output to stderr or stdout.

DAVID - BEFORE…
libgomp: TODO

I just encountered another useful piece of information. If I add …/compilers/lib/ from a newer NVIDIA HPC SDK to LD_LIBRARY_PATH I can bypass the issue. Running ldd on my solver before/after adding the extra path to LD_LIBRARY_PATH I see that the following libs are taken from the 21.2 installation. Also, the dependency on libgomp is now gone. To be clear, we do not use OpenMP in our solver, so I’m not sure why this dependency is there at all.

    libcudadevice.so => /common/pgi/Linux_x86_64/21.2/compilers/lib/libcudadevice.so (0x00007fd393f14000)
    libomp.so => /common/pgi/Linux_x86_64/21.2/compilers/lib/libomp.so (0x00007fd392151000)
    libpgmath.so => /common/pgi/Linux_x86_64/21.2/compilers/lib/libpgmath.so (0x00007fd391d30000)
    libpgc.so => /common/pgi/Linux_x86_64/21.2/compilers/lib/libpgc.so (0x00007fd391adc000)

I’ll keep trying to figure out what is going on. If the long term solution is to package some newer runtime libs with our solver that is fine with me as long as I can prove it to be stable.

Thanks,

-David

Ok, so what’s probably happening is that since “libgomp” includes the OpenACC API symbols and appears before the NHPC OpenACC runtime library in the LD_LIBRARY_PATH, the loader is picking up theses from libgomp. So is why when adjusting the LD_LIBRARY_PATH to redo the order in which the loader scans the directories, the code then works.

As to why libgomp is a dependency in the first place, that I’m not sure about. Though you may want to look at your application’s link to see what may be pulling it in, or it’s possibly something like the MPI being used.

Hi Mat,

Yes, I have confirmed what you suspect. I found that the easiest way to bypass the problem on my end is to use the libgomp from the NVIDIA HPC SDK package while using all other acc libraries from PGI 19. In other words, just adding the following to my launch scripts has resolved the problem:

LD_PRELOAD=/common/pgi/Linux_x86_64/21.2/compilers/lib/libgomp.so

After some additional investigation it seems that the gomp dependency comes from some library that was very recently added to our package. This explains why I never saw this behavior until now.

Thanks again for your help, I think I have all that I need to close this issue out for now.

-David