Missing cuda device code when trying to link nvc object file with gcc

I am trying to create a static lib (archive file in principle, though just a .o in this example) on Ubuntu 20.04 with nvc 21.9 and link this into a final executable with gcc. It can be linked and executed with nvc or when I create a shared object, but not with the object file. With gcc it links (after adding a bunch of libs), but when executing it fails:

$ ./foo-main 
Accelerator Fatal Error: No CUDA device code available
 File: /home/uie55546/git/ti-cuda-examples/src/scratch/foo.c
 Function: process:4
 Line: 4

Is creating a static archive with device code and linking with gcc possible?

I found a lot of older info on this topic with -ta=nordc issues with older versions of the PGI compilers that should be fixed with the HPC SDK compilers. I am hoping that there is an updated documentation for how to do this somewhere that I simply haven’t found yet and would apreciate some pointers.

Details of what I did:

//foo.h
typedef struct points {
    float* x; float* y;
    int n;
} points;
void process(points point);

//foo.c
#include "foo.h"

void process(points p) {
    #pragma acc parallel loop copy(p, p.x[:p.n]) copyin(p.y[:p.n])
    for (int i=0; i<p.n; ++i ) p.x[i] += p.y[i];
}

//foo-main.c
#include <stdlib.h>
#include <stdio.h>
#include "foo.h"

int main() {
    points p;
    p.n = 1000;
    p.x = ( float*) malloc ( sizeof ( float )*p.n );
    p.y = ( float*) malloc ( sizeof ( float )*p.n );
    process(p);
    printf("all done, exiting\n");
}

Object file:

$ nvc -fPIC -Minfo=accel -O3 -acc -c -o foo.o foo.c 
process:
      4, Generating copy(p) [if not already present]
         Generating copyin(p.y[:p.n]) [if not already present]
         Generating copy(p.x[:p.n]) [if not already present]
         Generating Tesla code
          6, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

$ gcc -std=c11 foo-main.c -o foo-main foo.o -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/11.4/targets/x86_64-linux/lib -Wl,-rpath /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib/acc_init_link_cuda.o /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib/acc_init_link_host.o /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib/acc_init_link_acc.o -lacchost -laccdevaux -laccdevice -ldl -lcudadevice -latomic -lnvhpcatm -lstdc++ -lnvomp -lnvc -lnvcpumath -lm -lcudadevrt -lcudart_static -lrt -lpthread

$ ./foo-main 
Accelerator Fatal Error: No CUDA device code available
 File: /home/uie55546/git/ti-cuda-examples/src/scratch/foo.c
 Function: process:4
 Line: 4

Shared library:

$ nvc -fPIC -Minfo=accel -O3 -acc -shared -o foo.so foo.c 
process:
      4, Generating copy(p) [if not already present]
         Generating copyin(p.y[:p.n]) [if not already present]
         Generating copy(p.x[:p.n]) [if not already present]
         Generating Tesla code
          6, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

$ gcc -std=c11 foo-main.c -o foo-main foo.so -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/11.4/targets/x86_64-linux/lib -Wl,-rpath /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib/acc_init_link_cuda.o /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib/acc_init_link_host.o /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib/acc_init_link_acc.o -lacchost -laccdevaux -laccdevice -ldl -lcudadevice -latomic -lnvhpcatm -lstdc++ -lnvomp -lnvc -lnvcpumath -lm -lcudadevrt -lcudart_static -lrt -lpthread

$ LD_LIBRARY_PATH="." ./foo-main 
all done, exiting

Linking with nvc:

$ nvc -acc foo-main.c -o foo-main foo.o
foo-main.c:

$ ./foo-main 
all done, exiting

For reference, I used the -dryrun option for nvc to get the required libs for the gcc command line.

Hi Cassfalg,

Yes, but with limitations.

“RDC” stands for “Relocatable Device Code” and allows for cross-file device function calling and access to global variables. This is enabled by default in OpenACC in order support features such as the “declare create” and “routine” directives. However, RDC enabled code requires linking device code with NVIDIA’s device linker, nvlink, which ‘gcc’ does not do and hence your error.

The easiest solution would be to use ‘nvc’ to perform the final link. The nvlink step will be applied and you wont need to add all the dependent compiler runtime libraries on the link.

Otherwise, you’ll need to disable RDC via the “-gpu=nordc” flag (formerly -ta=tesla:nordc) when compiling the OpenACC code. You wont be able to use some features, but for this simple example it should be fine.

Note for shared objects, we are able to perform the nvlink step to resolve device reference within the shared object itself. Though global device variables or cross-file device routine calls would need to be self-contained within the library.

Hope this helps,
Mat

We have one case where we have a larger application with a complex build system that we don’t want to touch too much. Switching the compiler is something we would like to avoid. Currently, the parts that we want to accelerate are in .a libs, hence my question. Switching to shared objects is relatively straight forward in the build system though, so we’re leaning towards that.

If I understand shared objects correctly, it is possible to call functions with accelerated regions from other shared objects? That’s just normal host side linking and calling, the device part has already been done within the shared objects in that case?

Let’s say I want to compile an executable with gcc that uses two shared objects A and B, both compiled with nvc/nvc++ and with OpenACC code. Is it possible to have functions in A allocate data and put it on the device via unstructured data regions and then process that data in B with present clauses without copying it back and forth in between? Not using global variables, pointers or references that the executable code gets from A and passes to B.

And really, feel free to point me to some documentation or books even with details. I feel like this should probably be documented somewhere and I’m just not looking in the right places? Or is this so bleeding edge that the documentation has not been written yet?

Yes, that’s fine. The host’s dynamic linker will resolved these calls on load. It’s just device kernels calling device function outside of the shared object that’s not available since there currently isn’t a dynamic linker for device code.

Note that we don’t have C++ RDC support in shared objects working yet, so stick with using C (or Fortran) for now. Though if you need to use C++, then compile with “-gpu=nordc”.

Is it possible to have functions in A allocate data and put it on the device via unstructured data regions and then process that data in B with present clauses without copying it back and forth in between? Not using global variables, pointers or references that the executable code gets from A and passes to B.

I haven’t tried this use case myself so am not 100% confident but believe it should be fine. I extended your example by adding an “initp” and “updatep” routines in a separate SO which did an unstructured data copy/update of “p”, then changed “process” to use the “present” clause. Worked fine. Though, let me know if you encounter issues in the larger app.

Or is this so bleeding edge that the documentation has not been written yet?

Well, we document OpenACC usage, and separately how to create shared objects. Since they pretty much “just work” together, there’s not really any need for separate documentation.

Note that there’s no need to all add the dependent runtime libs on the gcc link line. Though you will need to set your LD_LIBRARY_PATH to the lib directory of the runtime library so the loader can find them. Also, look for the “REDIST” directory in your install. These libraries can be redistributed with your executable if you are shipping your application to a third-party…

Here’s the modified version of your example extended to use two OpenACC enabled SOs:

% cat foo.h
//foo.h
typedef struct points {
    float* x; float* y;
    int n;
} points;
void process(points point);
void initp(points point);
void updatep(points point);

% cat foo.c
//foo.c
#include "foo.h"

void process(points p) {
    #pragma acc parallel loop present(p)
    for (int i=0; i<p.n; ++i ) p.x[i] += p.y[i];
}
% cat initp.c
//initp.c
#include "foo.h"

void initp(points p) {
    for (int i=0; i<p.n; ++i ) {
        p.x[i] = 1.;
        p.y[i] = 2.;
    }
#pragma acc enter data copyin(p)
#pragma acc enter data copyin(p.x[:p.n])
#pragma acc enter data copyin(p.y[:p.n])
}

void updatep(points p) {
#pragma acc update self(p.x[:p.n])
}
% cat foo-main.c
//foo-main.c
#include <stdlib.h>
#include <stdio.h>
#include "foo.h"

int main() {
    points p;
    p.n = 1000;
    p.x = ( float*) malloc ( sizeof ( float )*p.n );
    p.y = ( float*) malloc ( sizeof ( float )*p.n );
    initp(p);
    process(p);
    updatep(p);
    printf("p.x[1] = %f\n",p.x[1]);
    printf("all done, exiting\n");
}
% nvc -fPIC -Minfo=accel -O3 -acc -shared -o initp.so initp.c
initp:
     12, Generating enter data copyin(p,p.y[:p.n],p.x[:p.n])
updatep:
     16, Generating update self(p.x[:p.n])
% nvc -fPIC -Minfo=accel -O3 -acc -shared -o foo.so foo.c
process:
      4, Generating present(p)
         Generating NVIDIA GPU code
          6, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
      6, Accelerator restriction: size of the GPU copy of p.x,p.y is unknown
% gcc -std=c11 foo-main.c -o foo-main initp.so foo.so
% setenv LD_LIBRARY_PATH = <cwd>:$LD_LIBRARY_PATH
% setenv NV_ACC_TIME 1  ! I set our internal profiler to confirm that the kernel and data regions are called
% ./foo-main
p.x[1] = 3.000000
all done, exiting

Accelerator Kernel Timing data
initp.c
  initp  NVIDIA  devicenum=0
    time(us): 51
    12: data region reached 3 times
        12: data copyin transfers: 5
             device time(us): total=51 max=19 min=8 avg=10
/local/home/mcolgrove/initp.c
  updatep  NVIDIA  devicenum=0
    time(us): 20
    16: update directive reached 1 time
        16: data copyout transfers: 1
             device time(us): total=20 max=20 min=20 avg=20
foo.c
  process  NVIDIA  devicenum=0
    time(us): 7
    4: compute region reached 1 time
        4: kernel launched 1 time
            grid: [1024]  block: [128]
             device time(us): total=7 max=7 min=7 avg=7
            elapsed time(us): total=288 max=288 min=288 avg=288
    4: data region reached 2 times