OpenMP: cuModuleGetGlobal returned error 500

I’ve compiled a C++ OpenMP offloading application (OpenMC) using nvc++ 21.9-0. However, when I attempt to run it on a V100 (driver 470.57.02, CUDA 11.4), I get the following error:

Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found

In searching around on these forums, I found some reference to OpenACC codes getting this issue if global variables are declared (e.g., here). Are global variables also not supported for OpenMP (as below)?

#pragma omp declare target
… // global variable
#pragma omp end declare target

I unfortunately don’t have a minimal reproducer as OpenMC is quite large and the runtime error does not provide much info on what the problem might be.

Thanks!

Hi John,

Global variables are supported. Though maybe you’re missing one or more globals?

Though the other place I’ve seen this issue is with shared objects. Does the code use shared objects?

While not all our OpenMP and OpenACC runtime is shared, the data management is. Hence, you might try setting the environment variable “NVCOMPILER_ACC_DEBUG=1”. This will output quite a bit of info, but might show which variable it’s trying to load. Look for the calls to “pgi_uacc_cuda_static” in the output.

-Mat

Thanks for the tip on the debug flag! Yes, it gives much more info on what might be causing the issue:

pgi_uacc_upstart( file=/home/<...>/openmc_offload/openmc/src/random_lcg.cpp, function=openmc_set_seed, line=121:124, line=124, devid=1 )
cuda_init_device thread:1 data.default_device_num:1 pdata.cuda.default_device_num:1
cuda_init_device(threadid=1, device 0) dindex=1, api_context=0x1d6c2e0
Loading 38968 bytes from pgi_cuda_loc
pgi_uacc_cuda_static hostptr=0x7f59f81b2010-0x7f59f81b2014 size=4 name=_ZN6openmc5model13root_universeE flags=0x400
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found

The root_universe variable is declared in a file “geometry.h” as:

namespace openmc {
namespace model {
#pragma omp declare target
extern int root_universe;  //!< Index of root universe
#pragma omp end declare target
...

and is instantiated in “geometry.cpp” as:

namespace openmc {
namespace model {
int root_universe {-1};

Yes, OpenMC gets compiled as a shared library and then is linked in when its main.cpp gets compiled at the end.

This is possible, but I have been able to compile and run cleanly with both LLVM/Clang and Intel.

The use of “declare target” looks to be ok, so this may be the problem.

For features such as global variables, this requires relocatable device code (RDC) and linking using the device linker. However, NVIDIA doesn’t have a dynamic linker so using RDC couldn’t be used with shared objects. However not long ago we were able to create a method to enable RDC within shared objects. So long as the references were self contained within the shared object itself, it worked, at least for C and Fortran codes. For C++, there are still issues which may be what your encountering. Granted, this was for OpenACC, but the same process is also used for OpenMP.

I have an open issue report on this (TPR #28720), but since our engineers are largely focused on adding features and hardening our general OpenMP offload support, they haven’t had time to investigate. Though if indeed this is your issue, I can ask if we can increase the priority.

But before doing this, how difficult would it be to change your build process to statically link the binary rather than use the shared object? I’d like to confirm that indeed it’s the shared object that’s causing the issue before asking our team to divert from other high priority tasks.

Thanks,
Mat

Thanks for opening the issue report – much appreciated!

Changing the build system ended up being very easy. When building with static instead of dynamic linking, this bug goes away, so I think I can confirm that the shared object issue was indeed what was causing our problem. While this is a good workaround for now, we do have need to compile as a shared object for multiphysics simulations (with Nek5000/NekRS) down the line.

Anyway, while static linkage seems to have gotten around that bug, I did run into another one farther along in the program’s runtime. With the original dynamic linkage, the code was giving an error immediately at program launch:

./openmc --event
call to cuModuleGetGlobal returned error 500: Not found

However, there is different bug now with static linkage. The program now runs through initialization and file I/O routines cleanly, but when it gets to the first big kernel, it gives the following error:

./openmc --event
...
...
(OpenMC initialization, file I/O, etc)
...
...
Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0)
Aborted

I tried setting the debug environment variable, but the messages didn’t point to anything obvious to me. Perhaps you can help me decode the debug messages? Below are the last 10-20 lines or so:

pgi_uacc_upstart( file=/home/<...>/openmc_offload/openmc/include/openmc/shared_array.h, function=_ZN6openmc27process_calculate_xs_eventsERNS_11SharedArrayINS_14EventQueueItemEEE, line=113:149, line=147, devid=1 )
pgi_uacc_dataupa(devptr=0x1,hostptr=0xffc3e0,stride=1,size=1,extent=-1,eltsize=24,lineno=-147,name=_in_44954,flags=0x40020400=copyin+dynamic+openmp,async=-1,threadid=1)
pgi_uacc_dataupx(devptr=0x7fcf27e04000,hostptr=0xffc3e0,stride=1,size=1,extent=-1,eltsize=24,lineno=147,name=_in_44954,async=-1,threadid=1)
pgi_uacc_cuda_dataup1(devdst=0x7fcf27e04000,hostsrc=0xffc3e0,offset=0,stride=1,size=1,eltsize=24,lineno=147,name=_in_44954,threadid=1)
pgi_uacc_updone( devid=1 )
pgi_uacc_cuda_wait(lineno=-99,async=-1,dindex=1,threadid=1)
pgi_uacc_cuda_wait(sync on stream=0x21267a0,threadid=1)
pgi_uacc_cuda_wait done (threadid=1)
pgi_uacc_upstart( file=/home/<...>/openmc_offload/openmc/include/openmc/shared_array.h, function=_ZN6openmc27process_calculate_xs_eventsERNS_11SharedArrayINS_14EventQueueItemEEE, line=113:149, line=147, devid=1 )
pgi_uacc_dataupa(devptr=0x1,hostptr=0xffc3c0,stride=1,size=1,extent=-1,eltsize=24,lineno=-147,name=_in_44966,flags=0x40020400=copyin+dynamic+openmp,async=-1,threadid=1)
pgi_uacc_dataupx(devptr=0x7fcf27e03f00,hostptr=0xffc3c0,stride=1,size=1,extent=-1,eltsize=24,lineno=147,name=_in_44966,async=-1,threadid=1)
pgi_uacc_cuda_dataup1(devdst=0x7fcf27e03f00,hostsrc=0xffc3c0,offset=0,stride=1,size=1,eltsize=24,lineno=147,name=_in_44966,threadid=1)
pgi_uacc_updone( devid=1 )
pgi_uacc_cuda_wait(lineno=-99,async=-1,dindex=1,threadid=1)
pgi_uacc_cuda_wait(sync on stream=0x21267a0,threadid=1)
pgi_uacc_cuda_wait done (threadid=1)
pgi_uacc_get_device_num(devtype=4,threadid=1)
pgi_uacc_get_device_num(devtype=4,threadid=1) cuda devid=1 dindex=1 devnum=0
Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0)
Aborted

Or, if you have any other ideas on what might be causing this second issue, please let me know. Thanks for all your help!

Great, good to know. I’ll add your issue to the report and see if engineering can bump up the priority. Normally I’d have users build the SO without RDC enabled (-gpu=nordc), but RDC is required for use with global variables since they need to be linked, so isn’t an option for you.

Fatal error: expression ‘HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))’ (value 1) is not equal to expression ‘HX_SUCCESS’ (value 0)

Unfortunately, this is just a generic error indicating some kernel launch failed. As I noted before, only the data management portion of the runtime is shared between OpenACC and OpenMP “target teams distribute” constructs so using ACC_DEBUG wont be able to show you which kernel is failing. If you switch to using “target teams loop”, which shares more OpenACC code, then you can, but due to how “distribute” needs to be implemented, we don’t have a way to display this information.

Instead, you’ll need to use cuda-gdb or cuda-memcheck to determine where the error is occuring.

-Mat

Thanks for the debugging tips – much appreciated!

With cuda-gdb I’m able to identify the line that’s causing trouble.

Additionally, I had luck and was able to distill the bug into a fairly small reproducer (available here). This reproducer compiles and runs fine with:

  • LLVM/Clang on V100
  • AOMP on MI100
  • Intel on a Gen9

but gives this error when compiled and run with NVHPC (nvc++ 21.9-0, driver 470.57.02, CUDA 11.4) on a V100:

Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0)
Aborted

Hi John,

I can’t say why it would work with other compilers, but believe your code is in error. In the “copy_host_to_device” and “copy_device_to_host” methods, the code updates the “this” pointer itself. This causes a shallow update of the class data members, including the “data_” pointer. Hence the code is overwriting the device “data_” pointer with the host address. To fix, you’ll want to update the data members separately. For example:

  void copy_host_to_device()
  {
//    #pragma omp target update to(this[:1])
    #pragma omp target update to(data_[:capacity_])
    #pragma omp target update to(capacity_)
    #pragma omp target update to(size_)
  }

With this change, the code runs successfully for me:

% make
nvc++ -Wall -fopenmp -mp=gpu -Minfo=mp  -c main.cpp -o main.o
main:
      1, include "global.h"
           2, include "shared_array.h"
               18, #omp target teams distribute parallel for
                   18, Generating Tesla and Multicore code
                       Generating "nvkernel_main_F3L18_1" GPU kernel
                   22, Loop parallelized across teams and threads(128), schedule(static)
SharedArray<long>::operator [](long):
      1, include "global.h"
           2, include "shared_array.h"
               62, Generating implicit omp declare target routine
                   Generating Tesla code
SharedArray<long>::size():
      1, include "global.h"
           2, include "shared_array.h"
              123, Generating implicit omp declare target routine
                   Generating Tesla code
SharedArray<long>::allocate_on_device():
      1, include "global.h"
           2, include "shared_array.h"
              158, Generating target enter data map(create: data_[:capacity_])
SharedArray<long>::copy_host_to_device():
      1, include "global.h"
           2, include "shared_array.h"
nvc++ -Wall -fopenmp -mp=gpu -Minfo=mp  -c global.cpp -o global.o
nvc++ -Wall -fopenmp -mp=gpu -Minfo=mp  main.o global.o -o test
% ./test
value = 0
value = 1
value = 2
value = 3
value = 4
value = 5
value = 6
value = 7
value = 8
value = 9

Hope this helps,
Mat

1 Like

Hi Mat,

Yes, I can certainly see how the update to(this[:1]) could cause an issue. With your fix in place, that fixes the reproducer for me. However, I’m now running into another issue with OpenMC. I’ll try to distill this one down to a reproducer as well and will open another thread if I have luck isolating things.

Thanks so much for all your help!

1 Like