Unified binary creation with -tp and NVIDIA HPC SDK 21.2

david.gutzwiller · April 6, 2021, 6:48pm

Hello,

I am currently investigating a performance discrepancy between binaries compiled with GNU 7.4 and NVIDIA HPC SDK 21, with the NVIDIA HPC SDK prepared binary being around 40% slower on an Intel Skylake system. After some trial and error I have determined that the discrepancy comes from the “-tp=px” option that we are using to ensure backwards compatibility with older CPU architectures.

To improve performance on newer architectures while maintaining support for older systems I attempted to pass multiple values to -tp (-tp=px,skylake,zen), In the older PGI 19.X documentation it seems this should be supported, but I can’t find the same text in the NVIDIA HPC SDK documentation, and it seems to unsupported:

pgc+±Fatal-Switch -tp can only have one keyword value

Is the unified binary approach still supported? What is the current best practice for generating a single binary that supports older CPUs while maintaining competitive performance on newer architectures?

Thanks,

-David

MatColgrove · April 6, 2021, 8:59pm

Unfortunately no. This went away when we moved to the LLVM based back-end compiler. Hopefully we can bring it back at some point, but it wasn’t a widely used feature so not a high priority item.

What is the current best practice for generating a single binary that supports older CPUs while maintaining competitive performance on newer architectures?

Depend on how far back do you need to support. You’d want to use the oldest CPU you think your end-users would have, something like -tp=sandybridge or -tp=haswell.

david.gutzwiller · April 6, 2021, 9:09pm

Hi Mat,

Thanks for the quick response. Indeed, right now i’m building with a few different -tp options to see how it performs. I can’t imagine any of our customers using something older than Haswell, but i’ve been surprised before…

I’ll let you know if I run into any other problems,

Thanks again,

-David

david.gutzwiller · April 10, 2021, 12:55am

Hello Mat,

Thanks for the tips. I ended using -tp=sandybridge, which should give us sufficient backwards compatibility while matching the GNU speed.

I have a related followup with packaging for multiple CUDA versions. We absolutely want to have support for CUDA 11 / Ampere hardware, but at the same time I know that most of our users are still running CUDA 10. I encounter failures at runtime if I build on a system with CUDA 11 and then try to run on a system with CUDA 10. Is it possible to support multiple versions of CUDA with a single binary?

Thanks,

-David

MatColgrove · April 12, 2021, 4:02pm

Unfortunately, no. You can target multiple devices within the same binary, but need to use the same base CUDA version. However, Ampere based targets require CUDA 11.

david.gutzwiller · April 12, 2021, 4:41pm

Hi Mat,

Thanks for the response. To make I’m sure I’m really clear on this: if we want to support both Ampere systems running CUDA 11 and Volta systems running CUDA 10 there is no choice but to package duplicate binaries?

If that is case I’ll have to rapidly bring this up with our product managers as we have an upcoming release that will have to be adapted. I don’t think our customers will be willing to upgrade to the latest CUDA, they are chronically slow to upgrade.

Thanks,

-David

david.gutzwiller · April 12, 2021, 4:54pm

Hi Mat,

Building on my last comment: I suppose there is no way to include a stripped down CUDA 11 runtime with our package that would supersede an older CUDA installation on the user’s system?

Thanks,

-David

MatColgrove · April 12, 2021, 4:55pm

If your customers are unwilling to update their CUDA Drivers, then yes.

david.gutzwiller · April 12, 2021, 4:59pm

Thanks Mat.

Definitely better to know this now than a few months from now when the customers get ahold of it.

-David

david.gutzwiller · April 23, 2021, 3:41am

Hello Mat,

sorry to revive this old thread. I’m seeing some odd behavior with one of our release builds that I just don’t understand. This is related to the discussion of CUDA 10 / 11 we have a few weeks ago.

For this application we build with a slightly older version of the PGI compiler, V19.4, on a system without a native CUDA installation. It seems PGI is thus defaulting to linking with CUDA 10.1, which is what I expected. This application runs on any system I have tried with a CUDA 10.X installation (sample nvidia-smi output for one of these systems shown below):

| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |

However, when I try to run on a system with CUDA 11 it fails at the first OpenACC API call, which unsurprisingly is acc_get_num_devices(acc_device_nvidia). The system where it is failing is using driver version 450.102 and CUDA version 11.0:

| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |

Everything I’ve read online seems to indicate the applications built with an older version of CUDA should be compatible with newer CUDA runtimes. Am I missing something?

Thanks for your help

-David

MatColgrove · April 23, 2021, 2:23pm

No, CUDA drivers are backwards compatible so I would expect a CUDA 10.2 built binary to be able to be run using a CUDA 11.0 driver.

While I don’t have these exact CUDA driver versions, I tried building one of my programs on a CUDA 10.2 system (440.33.01) with “-gpu=cuda10.2,cc60,cc70 -tp penryn” using a P100, then ran successfully on a system with a V100 with CUDA 11.0 (450.51.06).

What’s the actual error? Could it be failing due to a missing dependent library?

david.gutzwiller · April 23, 2021, 5:00pm

Hello Mat,

That is the weird part, the only error is “libgomp: TODO”, which means nothing to me. The exact code being called by the solver is:

std::cout << "DAVID - BEFORE..." << std::endl;
int nbAttachedDevices  = acc_get_num_devices(acc_device_nvidia);
std::cout << "DAVID - AFTER..." << std::endl;

…and the output from the solver, running in serial, before returning to the terminal is the following. There is no other output to stderr or stdout.

DAVID - BEFORE…
libgomp: TODO

I just encountered another useful piece of information. If I add …/compilers/lib/ from a newer NVIDIA HPC SDK to LD_LIBRARY_PATH I can bypass the issue. Running ldd on my solver before/after adding the extra path to LD_LIBRARY_PATH I see that the following libs are taken from the 21.2 installation. Also, the dependency on libgomp is now gone. To be clear, we do not use OpenMP in our solver, so I’m not sure why this dependency is there at all.

    libcudadevice.so => /common/pgi/Linux_x86_64/21.2/compilers/lib/libcudadevice.so (0x00007fd393f14000)
    libomp.so => /common/pgi/Linux_x86_64/21.2/compilers/lib/libomp.so (0x00007fd392151000)
    libpgmath.so => /common/pgi/Linux_x86_64/21.2/compilers/lib/libpgmath.so (0x00007fd391d30000)
    libpgc.so => /common/pgi/Linux_x86_64/21.2/compilers/lib/libpgc.so (0x00007fd391adc000)

I’ll keep trying to figure out what is going on. If the long term solution is to package some newer runtime libs with our solver that is fine with me as long as I can prove it to be stable.

Thanks,

-David

MatColgrove · April 23, 2021, 5:49pm

Ok, so what’s probably happening is that since “libgomp” includes the OpenACC API symbols and appears before the NHPC OpenACC runtime library in the LD_LIBRARY_PATH, the loader is picking up theses from libgomp. So is why when adjusting the LD_LIBRARY_PATH to redo the order in which the loader scans the directories, the code then works.

As to why libgomp is a dependency in the first place, that I’m not sure about. Though you may want to look at your application’s link to see what may be pulling it in, or it’s possibly something like the MPI being used.

david.gutzwiller · April 23, 2021, 11:31pm

Hi Mat,

Yes, I have confirmed what you suspect. I found that the easiest way to bypass the problem on my end is to use the libgomp from the NVIDIA HPC SDK package while using all other acc libraries from PGI 19. In other words, just adding the following to my launch scripts has resolved the problem:

LD_PRELOAD=/common/pgi/Linux_x86_64/21.2/compilers/lib/libgomp.so

After some additional investigation it seems that the gomp dependency comes from some library that was very recently added to our package. This explains why I never saw this behavior until now.

Thanks again for your help, I think I have all that I need to close this issue out for now.

-David

Topic		Replies	Views
Unified binary for accelerators, serial? Legacy PGI Compilers	7	8353	November 6, 2013
win32 api issue and unified binary question Legacy PGI Compilers	5	6173	June 4, 2013
OpenMP + OpenACC problem Legacy PGI Compilers	9	5264	April 17, 2019
Provided PTX was compiled with an unsupported toolchain. on psgcluster CUDA Programming and Performance	11	7600	October 12, 2021
How to use CUDA compatibility package to use a newer driver on an older kernel module CUDA Setup and Installation	8	4997	July 8, 2019
Request support/help for PBS with OpenMPI Legacy PGI Compilers	21	14899	August 9, 2022
error compiling SDK - "/usr/bin/ld: cannot find -lcuda" CUDA Programming and Performance	18	28560	July 30, 2010
CUDA and Fedora 15 Can get dev drivers installed CUDA Programming and Performance	12	65758	December 8, 2011
MPICH linking failing Legacy PGI Compilers	12	12599	October 25, 2013
No Available accelerator Legacy PGI Compilers	7	6560	November 9, 2016

Unified binary creation with -tp and NVIDIA HPC SDK 21.2

Related topics