Shared library with openacc code and ccall only runs on hosts's gpu arch

cassfalg · July 18, 2024, 6:21pm

reproducer.tar.gz (1.4 KB)
Please consider this reproducer. I’m creating a small shared object with C code and some OpenACC pragmas. I have a main that calls this. Whatever I do, the resulting code only runs on the same GPU as the compilation host.

The build system is cmake, there’s a shell script to compile it that assumes nvc is in the path. I have tried this with nvhpc 24.5 24.3, 24.1 and 23.5. My host system is Ubuntu 20.04 and cmake 3.29.6. I’ve put 3.24 in the required, but am not sure that holds.

In our actual use-case we have a custom build system using docker images for isolation based on the nvcr.io/nvidia/nvhpc:24.5-devel-cuda12.4-ubuntu20.04 image. In CI this runs on an AWS g4dn instance. This is the simple reproducer.

This compiles and runs fine. However, when I analyze the resulting shared object with cuobjdump -all there is an elf section that only ever seems to have the compile host’s native compute arch embedded and no ptx code:

Fatbin elf code:
================
arch = sm_86
code version = [1,7]
host = linux
compile_size = 64bit
identifier = /tmp/pgcudafatnKDCNMmUzZv0.o

This seems to prevent the code to be executed on hosts with different GPUs. I have done a bunch of tests around this, specifying things like-gpu=ccall or -gpu=cc75 and the results are always multiple “identifier = ” sections in this style:

/tmp/pgcudafatnKDCNMmUzZv0.o or similar, only ever one compute arch, only elf
…/…/src/cuda_fill.c with elf for sm_35 to sm_90 including an sm_90 ptx
11.8/cuda_assign.o with elf for sm_35 to sm_90 including an sm_90 ptx
[…]/wrong_compute_arch_reproducer/reproducer-lib.c with elf for what I specified and ptx for the newest specified

My use-case is to create a general purpose library in CI on a Jenkins server that has a random GPU. I want this to potentially run on a wide variety of hosts, that have different GPUs from Jenkins. This /tmp/pgcudafat.o elf section is ruining my plan. What is this, why does this exist, how do I get rid of this in a proper way?

The current workaround is also strange: If I compile this on a host that does NOT have an accessible GPU, or in a docker container that is started without --gpus=all then this pgcudafat elf section DOES have the elf sections for all desired compute archs. Still no ptx though.

Does anyone know what’s going on here? How is this not a problem for lots of people? Am I doing anything weird that causes this?

MatColgrove · July 18, 2024, 7:08pm

Hi cassfalg,

Try disabling RDC via “-gpu=nordc”.

While we were able to add RDC support for C and Fortran, adding it for C++ proved challenging. Hence since your main in C++, RDC needs to be disabled.

The caveat is that some features such as global device variables and calling device subroutines found in different source files need the device link step, so are not supported without RDC.

-Mat

cassfalg · July 22, 2024, 11:54am

reproducer2.tar.gz (1.6 KB)

Hi Mat,

I’ll do one better: see reproducer2.tar.gz, using both gpu=nordc and a pure C main. Same behaviour.

I’m also confused by your comment: My questions is why a pure C shared object created with -gpu=ccall seems to have a code section that is only available for one single compute architecture and without ptx. That is a compile / link time thing, right? How does a C or C++ main that dynamically links against this matter? That’s a runtime thing?

Regards,
Chris

MatColgrove · July 22, 2024, 4:08pm

Hi Chris,

It looks like you only added it to the compile? “nordc” needs to be added to both the compile and link of the shared object.

I also just noticed that you don’t add “-acc” when linking “reproducer-main”. If you add that, you might be able to then compile and link the SO with RDC. Though if you’re using another compiler to create the main, like g++, then stick with nordc.

My questions is why a pure C shared object created with -gpu=ccall seems to have a code section that is only available for one single compute architecture and without ptx. That is a compile / link time thing, right? How does a C or C++ main that dynamically links against this matter? That’s a runtime thing?

I don’t quite know all the details, but believe it has to do with how the device code gets initialized. When you have “-acc” on the link for a main program, the compiler dynamically creates a object that does the device initialization which is invoked when the binary is loaded (i.e. before “main” is called). With RDC, each target device has it’s own distinct device code generation. The initialization determines which device is on the system and sets which version to use.

Without RDC, the device initialization is delayed until the first device code is generated. The code is also generic (not device specific) and get’s JIT compiled when the kernels are first launched. These cached so you only get the JIT overhead once.

-Mat

cassfalg · July 23, 2024, 8:11am

Is this /tmp/pgaccel*.o what contains this initialization code? If yes, there’s no ptx in sight.

In the end, I don’t think we need RDC, and putting it on the compile and link line seems to completely remove this offending section. I will test this with our production setup. We indeed need to link against SOs with gcc.

Do you have some in depth articles, whitepapers or similar that explain the intricacies of RDC? This has cropped up more than once. I feel I need to understand this to not stumble around in the dark.

MatColgrove · July 23, 2024, 3:29pm

No, the object names would be “acc_init_link_acc.o”, “acc_init_link_host.o”, or “acc_init_link_cuda.o”. The combination used depends on what flags are being used. Also, initialization would be done from the host, so no PTX.

Do you have some in depth articles, whitepapers or similar that explain the intricacies of RDC?

RDC has been around for a long time (circa CUDA 5 in 2012) and I recall seeing these then, but can’t find them off-hand.

Best I can do is point you to our docs on JIT compilation and the NVCC docs on separate compilation.

cassfalg · July 23, 2024, 3:55pm

What is this strange temporary object file then? Does this also occur on your end? Why does it not respect the flags for which compute archs to compile code for? Why is there no PTX when everything else has PTX?

The thing is… Things like this make me loose trust in the OpenACC ecosystem. I have this weird error. It breaks my use-case. It does not fit my understanding of how things work. It is weirdly quirky, what with the difference between having a GPU available or not at COMPILE time. It is contrary to the flags I use. I can not find anything on it using google. How am I supposed to work with this? I’ll never know when I’ll stumble into the next weird thing that has me busy for a week.

By the way, if your general recommendation seems to be to disable RDC, why is it on by default?

MatColgrove · July 23, 2024, 6:22pm

This was confusing me as well. I couldn’t reproduce the error but missed that you’re building on a system without a GPU. After I tried that, then I see the error and think it’s a compiler bug.

We did have an issue with sm_89 where it was missing from the “ccall” flag which in turn caused it to miss the cuda fill step (i.e. the code that registers device binaries) and no device code was generated. This is expected to be fixed in 24.7.

Your issue is different but may be related. “pgcudafat” objects are the fat binary object in which the various device binaries files are gathered. Though I didn’t think we’ve used this method for quite some time, so my assumption is when building for sm_89 we’re going down a old path. I’m also seeing some issue with other “cc” targets as well so there may be a bigger general issue when creating SO’s on systems without a GPU.

Let me talk with engineering to confirm, and then file a bug report.

For now, using “nordc” seems to work given it skips the compile time device code generation in favor of the runtime JIT compile.

By the way, if your general recommendation seems to be to disable RDC, why is it on by default?

In general we recommend using RDC. The only cases where RDC needs to be disabled is when building shared object with C++ or when the SO is linked with a different compiler family such as GNU.

Since you’re linking the SO with C++, I initially just presumed that this was your issue. Though given the SO itself is C, I should have dug into it further. Apologies and thanks for keeping the pressure on.

cassfalg · July 24, 2024, 6:42pm

gpu-cuobjdump.txt (4.7 KB)
no-gpu-cuobjdump.txt (5.8 KB)

I wouldn’t say that my issue only occurs when compiling without a GPU. It just makes a difference and I didn’t really expect that. I’ve attached my cuobjdump -all build/libreproducer-lib.so’s for reference. Both for a build with gpu and one without. Both contain this /tmp/pgcudafat*.o section. It’s just that the one without gpu has more compute archs in it so it is more usable in practice.

This was done in a docker that is a slight extension from your nvcr.io/nvidia/nvhpc:24.5-devel-cuda_multi-ubuntu20.04 image. I’ve reduced the cmake required to 3.14 and changed the compiler ID in the cmake script from NVHPC to PGI to account for the older cmake support scripts. The host here is Ubuntu 20.04 with Driver Version: 555.42.06 from 12.5 in the current apt repos.

Does using cmake change the equation?

MatColgrove · July 24, 2024, 10:14pm

I’m still stumped on this one. I sent the reproducer to one of our compiler engineers for advice. However, the code just worked for him. I then tried it again on the same systems as yesterday, but then it worked for me. I spent about 3 hours this morning trying various things to get it to fail again, but no luck. Just keeps passing.

My objdumps look the same as yours so don’t know if it’s relevant or not.

Does using cmake change the equation?

Possible but I built with Cmake as well as when executed the compile commands in a shell script and directly on the command line. When I did see the error, I was using a bash script.

My next step is to see if it’s environmental or some flag combination. I have some high priority items to work on, but will keep working on this as a back ground task.

cassfalg · July 25, 2024, 7:32am

When you say fail/pass do you actually execute it on a different machine with different GPU? Because I mostly check cuobjdump, it’s faster and more convenient for me. I have some machines for other compute archs, but not conveniently. Just to make sure we’re talking about the same thing.

The good news I suppose is that disabling rdc works for us so there’s really no time pressure. I’m looking forward to what you’ll come up though, I’m stumped as well.

MatColgrove · July 25, 2024, 6:19pm

Correct. I’m building on a system without a GPU and then run it on system with a H100, which is where I saw the failure but can no longer reproduce.

When I get back to it, I should also try a L4 or 4080 so sm_89 is in the mix.

cassfalg · July 29, 2024, 7:24am

In my testing though building without a GPU gave the “best” results as in the most SMs included in the SO. When built on a system with GPU the code only included the “native” SM code.

Initially our build on Jenkins had a Turing based card (sm_75) and I couldn’t run it on our Ampere based workstations (sm_86).

MatColgrove · July 29, 2024, 11:54pm

I was able to get back to this and, after starting over, was able to get it to consistently fail and then with one change, get it to consistently pass. Now given all the inconsistency I had earlier, I can’t be sure this will work for you, but let’s give it a try.

The problem seems to be that “-gpu=ccall” is missing from the link line for the shared object when building via your cmake setup. Hence the link step doesn’t know to generate the various different target binaries and only links for a single target.

Here’s the script I’m using to build

#[ 16%] Building C object CMakeFiles/reproducer-lib.dir/reproducer-lib.c.o
/proj/nv/Linux_x86_64/dev/compilers/bin/nvc -Dreproducer_lib_EXPORTS  -O2 -gopt -fPIC -acc -Minfo=all -gpu=ccall,lineinfo -tp=x86-64-v3 -MD -MT CMakeFiles/reproducer-lib.dir/reproducer-lib.c.o -MF CMakeFiles/reproducer-lib.dir/reproducer-lib.c.o.d -o CMakeFiles/reproducer-lib.dir/reproducer-lib.c.o -c /home/mcolgrove/tmp/rep2/reproducer-lib.c
#[ 33%] Linking C shared library libreproducer-lib.so
rm libreproducer-lib.so

# Linking without ccall - Fails
/proj/nv/Linux_x86_64/dev/compilers/bin/nvc -fPIC -O2 -gopt -acc  -static-nvidia -shared -Wl,-soname,libreproducer-lib.so -o libreproducer-lib.so "CMakeFiles/reproducer-lib.dir/reproducer-lib.c.o"

# Linking with ccall - Passses
#/proj/nv/Linux_x86_64/dev/compilers/bin/nvc -fPIC -O2 -gopt -acc  -gpu=ccall -static-nvidia -shared -Wl,-soname,libreproducer-lib.so -o libreproducer-lib.so "CMakeFiles/reproducer-lib.dir/reproducer-lib.c.o"


#[ 50%] Building C object CMakeFiles/reproducer-main.dir/reproducer-main.c.o
/proj/nv/Linux_x86_64/dev/compilers/bin/nvc   -O2 -gopt -MD -MT CMakeFiles/reproducer-main.dir/reproducer-main.c.o -MF CMakeFiles/reproducer-main.dir/reproducer-main.c.o.d -o CMakeFiles/reproducer-main.dir/reproducer-main.c.o -c /home/mcolgrove/tmp/rep2/reproducer-main.c
#[ 66%] Linking C executable reproducer-main
/usr/local/bin/cmake -E cmake_link_script CMakeFiles/reproducer-main.dir/link.txt --verbose=1
/proj/nv/Linux_x86_64/dev/compilers/bin/nvc -O2 -gopt "CMakeFiles/reproducer-main.dir/reproducer-main.c.o" -o reproducer-main  -Wl,-rpath,/home/mcolgrove/tmp/rep2/build libreproducer-lib.so
#[ 66%] Built target reproducer-main
#[ 83%] Building CXX object CMakeFiles/reproducer-main-cpp.dir/reproducer-main.cpp.o
/proj/nv/Linux_x86_64/dev/compilers/bin/nvc++   -O2 -gopt -MD -MT CMakeFiles/reproducer-main-cpp.dir/reproducer-main.cpp.o -MF CMakeFiles/reproducer-main-cpp.dir/reproducer-main.cpp.o.d -o CMakeFiles/reproducer-main-cpp.dir/reproducer-main.cpp.o -c /home/mcolgrove/tmp/rep2/reproducer-main.cpp
#[100%] Linking CXX executable reproducer-main-cpp
/usr/local/bin/cmake -E cmake_link_script CMakeFiles/reproducer-main-cpp.dir/link.txt --verbose=1
/proj/nv/Linux_x86_64/dev/compilers/bin/nvc++ -O2 -gopt "CMakeFiles/reproducer-main-cpp.dir/reproducer-main.cpp.o" -o reproducer-main-cpp  -Wl,-rpath,/home/mcolgrove/tmp/rep2/build libreproducer-lib.so

I’m building on a system with an RTX6000 (sm_89) and then running on a system with a H100 (sm_90).

Without “ccall”, here’s the error I’m seeing:

% ./reproducer-main-cpp
Failing in Thread:1
Accelerator Fatal Error: call to cuLinkComplete returned error 209 (CUDA_ERROR_NO_BINARY_FOR_GPU): No binary for GPU
 File: /home/mcolgrove/tmp/rep2/reproducer-lib.c
 Function: create:3
 Line: 5

With “ccall” on the link, it runs fine:

% ./reproducer-main-cpp
x[2] is: 2, y[2] is: 8

cassfalg · July 30, 2024, 7:46am

I haven’t managed to execute it on a different host, but having -gpu=ccall on the link line via adding it to the target_link_option calls or not does change the elf sections, same as having a GPU during compilation or not. Still no PTX section though - why is that? That means this is not forward compatible to newer GPUs, right?

MatColgrove · July 30, 2024, 4:42pm

Still no PTX section though - why is that?

With RDC, device binaries are created. If you want PTX, use “nordc”.

However the default PTX target will be for the latest device, currently cc90, or the device on the build system. What might be missing is adding the minimum target, for example if the minimal target is a V100, then “-gpu=nordc,cc70”.

% cuobjdump -lptx libreproducer-lib.so
PTX file    1: reproducer-lib.sm_70.ptx

cassfalg · July 30, 2024, 5:18pm

With nordc there is not /tmp/pgcudafat*.o and elf and a ptx sections as I would expect.

But with rdc there is elf and a ptx section for everything except the /tmp/pgcudafat*.o part, which makes the other ptx sections useless, doesn’t it? Since there is a section without PTX there is no point in having forward-compatible ptx sections for other parts. This is what caused the issue in the first place I think. If I look at cuobjdump -all most code sections in the resulting so look fine (as in contain ptx and multiple sm elf sections), except this one section. I’m assuming that there is a need for an appropriate section for all parts of the so at least?

I’m still not sure how things really work I have to admit.

If I use those settings:

    target_compile_options(OpenACC::OpenACC_C
        INTERFACE -Minfo=all# -Mreentrant
            -gpu=ccall,lineinfo
            -tp=x86-64-v3)
    target_link_options(OpenACC::OpenACC_C INTERFACE -static-nvidia -gpu=ccall)

then I have this:

$ cuobjdump -lelf build/libreproducer-lib.so
ELF file    1: pgcudafatjIohBkyp9F4u.sm_50.cubin
ELF file    2: pgcudafatjIohBkyp9F4u.sm_60.cubin
ELF file    3: pgcudafatjIohBkyp9F4u.sm_61.cubin
ELF file    4: pgcudafatjIohBkyp9F4u.sm_70.cubin
ELF file    5: pgcudafatjIohBkyp9F4u.sm_75.cubin
ELF file    6: pgcudafatjIohBkyp9F4u.sm_80.cubin
ELF file    7: pgcudafatjIohBkyp9F4u.sm_86.cubin
ELF file    8: pgcudafatjIohBkyp9F4u.sm_89.cubin
ELF file    9: pgcudafatjIohBkyp9F4u.sm_90.cubin
ELF file   10: cuda_fill.sm_35.cubin
ELF file   11: cuda_fill.sm_50.cubin
ELF file   12: cuda_fill.sm_60.cubin
ELF file   13: cuda_fill.sm_61.cubin
ELF file   14: cuda_fill.sm_70.cubin
ELF file   15: cuda_fill.sm_75.cubin
ELF file   16: cuda_fill.sm_80.cubin
ELF file   17: cuda_fill.sm_86.cubin
ELF file   18: cuda_fill.sm_89.cubin
ELF file   19: cuda_fill.sm_90.cubin
ELF file   20: cuda_assign.sm_35.cubin
ELF file   21: cuda_assign.sm_50.cubin
ELF file   22: cuda_assign.sm_60.cubin
ELF file   23: cuda_assign.sm_61.cubin
ELF file   24: cuda_assign.sm_70.cubin
ELF file   25: cuda_assign.sm_75.cubin
ELF file   26: cuda_assign.sm_80.cubin
ELF file   27: cuda_assign.sm_86.cubin
ELF file   28: cuda_assign.sm_89.cubin
ELF file   29: cuda_assign.sm_90.cubin

$ cuobjdump -lptx build/libreproducer-lib.so
PTX file    1: cuda_fill.sm_90.ptx
PTX file    2: cuda_assign.sm_90.ptx

But that is missing all of the sections for my own code:

$ cuobjdump -all build/libreproducer-lib.so
#[...] sections corresponding to the -lptx and -lelf output above
Fatbin elf code:
================
arch = sm_50
code version = [1,7]
host = linux
compile_size = 64bit
compressed
identifier = /home/[...]/reproducer-lib.c
#[...] other sm's
Fatbin elf code:
================
arch = sm_90
code version = [1,7]
host = linux
compile_size = 64bit
compressed
identifier = /home/[...]/reproducer-lib.c

Fatbin ptx code:
================
arch = sm_90
code version = [8,4]
host = linux
compile_size = 64bit
compressed
identifier = /home/[...]/reproducer-lib.c
ptxasOptions =

Not sure that is relevant, but it is something related that I also do not understand. And the reason why I use -all all the time.

MatColgrove · July 30, 2024, 6:11pm

I believe the PTX sections you’re seeing here are coming from the runtime libraries. If you take off “-static-nvidia”, they go away, at least they do for me:

% cuobjdump -lptx libreproducer-lib.so
cuobjdump info    : No PTX file found to extract from '/home/mcolgrove/tmp/reproducer/build/libreproducer-lib.so'. You may try with -all option.

Now with “cuobjdump -all”, I do see the one Fatbin ptx code:

Fatbin ptx code:
================
arch = sm_90
code version = [8,4]
host = linux
compile_size = 64bit
compressed
identifier = /home/mcolgrove/tmp/reproducer/reproducer-lib.c
ptxasOptions =

I don’t know enough about cuobjdump to say why this doesn’t show up under the “-lptx” flag. I suspect it gets embedded in the fat binaries to handle forward compatibility, but would need to check with engineering to confirm.

Topic		Replies	Views
Clarification on using OpenACC in a shared library Legacy PGI Compilers	27	4528	December 9, 2020
Cannot dynamically load a shared library containing both OpenACC and CUDA code nvc, nvc++ and nvfortran	8	2662	August 24, 2022
OpenACC - CUDA interop with CMake nvc, nvc++ and nvfortran	9	725	October 31, 2023
Separate Compilation and Linking of CUDA C++ Device Code Technical Blog	39	1696	September 8, 2019
Building Cross-Platform CUDA Applications with CMake Technical Blog	79	4088	October 27, 2021
Porting to the GPU Any Easy way to Port Code CUDA Programming and Performance	18	17826	October 20, 2009
OpenACC Region: Command exited with non-zero status 1 nvc, nvc++ and nvfortran cuda	21	1852	October 14, 2021
Problem with NVFORTRAN and R nvc, nvc++ and nvfortran	46	2763	April 25, 2024
Unable to link project when using nvc++ and nvc nvc, nvc++ and nvfortran hpc	13	1577	March 2, 2023
Issue with cudaMemcpyToSymbol and Separable Compilation. CUDA Programming and Performance	10	1253	February 28, 2019

Shared library with openacc code and ccall only runs on hosts's gpu arch

Related topics