OpenMP offloading with nvc++ compiler wont run on GPU

I am trying to do some GPU offloading on Perlmutter for a c++ project but I can’t actually get my code to run on a GPU. Using the OpenMP target directive on my main loop, I can get the code to compile and run. However, using NVIDIA Insight I can see my code is not actually running on the GPU. I have a stripped down version of the code with the same data structures and loop as the real thing. The stripped down version offloads to the GPU just fine, confirmed with NVIDIA Insight. Even giving the stripped down version a much larger problem than the real code, the stripped down version runs on the GPU and the real thing is only running on the CPU. The only indication that something is going wrong with the actual code is an error that pops up very inconsistently. The only consistent rule about this error I’ve found is that it comes up when running the real code with a Perlmutter interactive job without running it through nsys. No matter what I have tried so far I cannot reproduce the error in the stripped down version, even with the exact same loop, loop contents, and openmp directives. I have also tried many different combinations of openmp directives, all with the same result.

I am compiling with:
nvc++ -std=c++2a -O2 -Mlarge_arrays -mp=gpu -gpu=cc80 -Minfo=mp,accel -Minline -o program.x program.cpp

I am running the code with:
srun -n 1 -N 1 -G 1 -c 1 --cpu-bind=cores numactl --interleave=all nsys profile -o prof ./program.x 2>sim_err.log 1>sim_out.log

The error is:
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found

The desired loop/openmp directive structure is:

#pragma omp target teams distribute parallel for reduction(+:PE) default(none) private(A,B,pc) shared(acc,aacc,world_rank,world_size,Ha,write_step,lllen,R,pos,vel,m,w,u_r,u_s,moi,kin,kout,distances,h_min,dt) defaultmap(none) map(tofrom:acc[0:num_particles],aacc[0:num_particles],PE) map(to:write_step,vel[0:num_particles],m[0:num_particles],moi[0:num_particles],pos[0:num_particles],distances[0:num_pairs])
        for (pc = 1; pc <= num_pairs; pc++)
        {
                  //Too much loop code to add here
        }

The code runs just fine with CPU threads only (everything is initialized, at least on the host). Any idea what this error could mean or things to try to force the code onto the GPU? Apologies in advance if this is the wrong place for this. Let me know and I’m happy to move it. I’m also happy to provide other information if needed. Thanks in advance!

Hi kolanzi,

call to cuModuleGetGlobal returned error 500: Not found

This means at runtime there’s a missing global variable.

In the full code, are you using this in a shared object library? Relocatable device code (RDC) is needed to link global references, but RDC isn’t supported in C++ SOs which in turn can cause this error.

Other possibilities are that you’re missing a “omp declare target” region around the a global variable, hence the device global isn’t getting created. This could be coming from a STL routine, template, or some hidden reference that you didn’t add directly.

Is your project available that I can try to reproduce it here? If not, can you post the body of the offloaded loop? Might give some clues as to what’s going on.

-Mat