I’m creating a shared library from C code that has a number of OpenACC pragmas using NVHPC SDK 24.3. As compile options I have this:
-O2 -gopt -c99 -fPIC -acc -Minfo=accel -gpu=cuda11.8,ccall,lineinfo -tp=x86-64-v3
Now in my understanding -acc defaults to -acc=gpu so it should only generate GPU code, right? Looks like it contains CPU code as well:
The C-Code that gets compiled into this lib contains something like this, so it implicitely prints some statistics on what’s going on when the lib gets called:
#ifdef _OPENACC
printf("attempting to dump present table\n");
acc_present_dump_all();
printf("kernels since start: %u\n", acc_kernels());
#endif
When I link against this with g++ I’m getting a binary that prints present tables and shows a positive number for kernels since start. It is also pretty quick. Using nsys profile on it shows data transfers, kernel calls and so on as expected.
When I use nvc++ instead without -acc options I’m getting a binary that consistently shows “kernels since start: 0”, dumps no present table and is noticably slower – as if it were running a CPU version of the code. Same result with “nvc++ -acc=host”, “nvc++ -acc=multicore” or “nvc++ -acc=host,multicore”. Only if I use nvc++ -acc or any combination that contains gpu do I get the gpu version. nsys profile also shows gpu things happening only when -acc contains gpu or nothing, and “does not contain […]” otherwise.
Now… What is going on? Why does my shared lib contain both cpu and gpu code? Can I force this with -acc=gpu,multicore? How and where is chosen which version to run? If I were to run the gcc compiled binary on a host without gpu, would it fallback to the CPU code? That would actually be nice in our case, though this did confuse me to no end initially. Any insight would be apreciated. I haven’t managed to boil it down into a minimum reproducer unfortunately or I would share that…
OpenACC can be put into shared objects when building with C (nvc) or Fortran (nvfortran). However, we don’t support device linking of shared objects when using C++ (nvc++). So for this case, you may need to add “-gpu=nordc” so relocatable device code is not used and the device link step is skipped.
I am unclear in that you say that this is C code, but it appears that you are using nvc++? If so, then you can also try using nvc instead of nvc++.
Likely what’s happening here is that since no device linking is done, at runtime it can’t detect that the device is available so is using the host fallback. When you link with “nvc++ -acc” the device link does occur, and why it works. g++ doesn’t know how to do device linking.
The shared object is pure C compiled with nvc. The main program is C++ and thus g++/nvc++ is involved. It calls a C function as entry point in the shared object, where all GPU related things happen. There is no call from the main program into kernels or similar, in fact there is not even any kernels in the main program. I’ve put some debugging statements in it to try things, but it’s unrelated to the OpenACC code in the shared object. Things like this:
int main() {
char * buf = new char[200];
#ifdef _OPENACC
printf("one\n");
acc_present_dump_all();
#pragma acc enter data create(buf[0:200])
printf("two\n");
acc_present_dump_all();
#pragma acc exit data delete(buf[0:200])
printf("three\n");
acc_present_dump_all();
#endif
Otherwise there’s no OpenACC pragmas or runtime calls in the main program. A device link step shouldn’t be necessary?
I’m still not sure why there is cpu code in the lib when compiling with just -acc. The reference manual at least states:
2.2.2. -[no]acc
Enable [disable] OpenACC directives. The following suboptions may be used following
an equals sign (“=”), with multiple sub-options separated by commas:
gpu
(default) OpenACC directives are compiled for GPU execution only.
Otherwise there’s no OpenACC pragmas or runtime calls in the main program. A device link step shouldn’t be necessary?
Device linking is required when using RDC for either main or the shared object.
The shared object is pure C compiled with nvc.
RDC is expected to work with C or Fortran shared objects, but you can try adding “-gpu=nordc” to see if it works around the issue.
I’m still not sure why there is cpu code in the lib when compiling with just -acc. The reference manual at least states:
What I’d expect to happen is that you’d get a runtime failure saying that no device is found for this target, but given you’re not seeing this error, I’m guessing that it’s falling back to the host code. Then again RDC should work with C shared objects so I may not be diagnosing the issue correctly.
Try the “-pgu=nordc” flag, if that doesn’t work we can work through other options and if possible, create a reproducing example so I can investigate.