Can an OpenACC accelerated shared object contain cpu and gpu code both?

cassfalg · April 29, 2024, 4:33pm

I’m creating a shared library from C code that has a number of OpenACC pragmas using NVHPC SDK 24.3. As compile options I have this:
-O2 -gopt -c99 -fPIC -acc -Minfo=accel -gpu=cuda11.8,ccall,lineinfo -tp=x86-64-v3
Now in my understanding -acc defaults to -acc=gpu so it should only generate GPU code, right? Looks like it contains CPU code as well:

The C-Code that gets compiled into this lib contains something like this, so it implicitely prints some statistics on what’s going on when the lib gets called:

#ifdef _OPENACC
printf("attempting to dump present table\n");
acc_present_dump_all();
printf("kernels since start: %u\n", acc_kernels());
#endif

When I link against this with g++ I’m getting a binary that prints present tables and shows a positive number for kernels since start. It is also pretty quick. Using nsys profile on it shows data transfers, kernel calls and so on as expected.

When I use nvc++ instead without -acc options I’m getting a binary that consistently shows “kernels since start: 0”, dumps no present table and is noticably slower – as if it were running a CPU version of the code. Same result with “nvc++ -acc=host”, “nvc++ -acc=multicore” or “nvc++ -acc=host,multicore”. Only if I use nvc++ -acc or any combination that contains gpu do I get the gpu version. nsys profile also shows gpu things happening only when -acc contains gpu or nothing, and “does not contain […]” otherwise.

Now… What is going on? Why does my shared lib contain both cpu and gpu code? Can I force this with -acc=gpu,multicore? How and where is chosen which version to run? If I were to run the gcc compiled binary on a host without gpu, would it fallback to the CPU code? That would actually be nice in our case, though this did confuse me to no end initially. Any insight would be apreciated. I haven’t managed to boil it down into a minimum reproducer unfortunately or I would share that…

MatColgrove · April 29, 2024, 7:35pm

OpenACC can be put into shared objects when building with C (nvc) or Fortran (nvfortran). However, we don’t support device linking of shared objects when using C++ (nvc++). So for this case, you may need to add “-gpu=nordc” so relocatable device code is not used and the device link step is skipped.

I am unclear in that you say that this is C code, but it appears that you are using nvc++? If so, then you can also try using nvc instead of nvc++.

Likely what’s happening here is that since no device linking is done, at runtime it can’t detect that the device is available so is using the host fallback. When you link with “nvc++ -acc” the device link does occur, and why it works. g++ doesn’t know how to do device linking.

cassfalg · April 30, 2024, 7:17am

The shared object is pure C compiled with nvc. The main program is C++ and thus g++/nvc++ is involved. It calls a C function as entry point in the shared object, where all GPU related things happen. There is no call from the main program into kernels or similar, in fact there is not even any kernels in the main program. I’ve put some debugging statements in it to try things, but it’s unrelated to the OpenACC code in the shared object. Things like this:

int main() {
    char * buf = new char[200];
#ifdef _OPENACC
    printf("one\n");
    acc_present_dump_all();
    #pragma acc enter data create(buf[0:200])
    printf("two\n");
    acc_present_dump_all();
    #pragma acc exit data delete(buf[0:200])
    printf("three\n");
    acc_present_dump_all();
#endif

Otherwise there’s no OpenACC pragmas or runtime calls in the main program. A device link step shouldn’t be necessary?

I’m still not sure why there is cpu code in the lib when compiling with just -acc. The reference manual at least states:

2.2.2. -[no]acc
Enable [disable] OpenACC directives. The following suboptions may be used following
an equals sign (“=”), with multiple sub-options separated by commas:
gpu
(default) OpenACC directives are compiled for GPU execution only.

MatColgrove · April 30, 2024, 3:27pm

Otherwise there’s no OpenACC pragmas or runtime calls in the main program. A device link step shouldn’t be necessary?

Device linking is required when using RDC for either main or the shared object.

The shared object is pure C compiled with nvc.

RDC is expected to work with C or Fortran shared objects, but you can try adding “-gpu=nordc” to see if it works around the issue.

I’m still not sure why there is cpu code in the lib when compiling with just -acc. The reference manual at least states:

What I’d expect to happen is that you’d get a runtime failure saying that no device is found for this target, but given you’re not seeing this error, I’m guessing that it’s falling back to the host code. Then again RDC should work with C shared objects so I may not be diagnosing the issue correctly.

Try the “-pgu=nordc” flag, if that doesn’t work we can work through other options and if possible, create a reproducing example so I can investigate.

Topic		Replies	Views
Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work nvc, nvc++ and nvfortran	5	863	April 13, 2022
C++ Smart Pointers and OpenACC nvc, nvc++ and nvfortran nvcc	3	314	July 31, 2024
Missing cuda device code when trying to link nvc object file with gcc nvc, nvc++ and nvfortran	3	1197	March 4, 2022
cudaGetSymbolAddress error when mixing OpenACC and shared libraries nvc, nvc++ and nvfortran	1	477	July 14, 2022
Separate compilation of mixed CUDA OpenACC code nvc, nvc++ and nvfortran	2	1093	October 22, 2021
Just released: HPC SDK 24.9 nvc, nvc++ and nvfortran	9	101	October 8, 2024
Hybrid runs on CPU and GPU - OpenACC nvc, nvc++ and nvfortran openmpi	6	1458	May 23, 2022
Selecting host compiler with nvc++ nvc, nvc++ and nvfortran	3	705	July 5, 2023
OpenACC routine/nordc problem Legacy PGI Compilers	5	2803	June 4, 2018
Nvc++ -stdpar functionality possible without single compilation unit? host linker? nvc, nvc++ and nvfortran	4	734	December 30, 2022

Can an OpenACC accelerated shared object contain cpu and gpu code both?

Related topics