I am trying to do some GPU offloading on Perlmutter for a c++ project but I can’t actually get my code to run on a GPU. Using the OpenMP target directive on my main loop, I can get the code to compile and run. However, using NVIDIA Insight I can see my code is not actually running on the GPU. I have a stripped down version of the code with the same data structures and loop as the real thing. The stripped down version offloads to the GPU just fine, confirmed with NVIDIA Insight. Even giving the stripped down version a much larger problem than the real code, the stripped down version runs on the GPU and the real thing is only running on the CPU. The only indication that something is going wrong with the actual code is an error that pops up very inconsistently. The only consistent rule about this error I’ve found is that it comes up when running the real code with a Perlmutter interactive job without running it through nsys. No matter what I have tried so far I cannot reproduce the error in the stripped down version, even with the exact same loop, loop contents, and openmp directives. I have also tried many different combinations of openmp directives, all with the same result.
I am compiling with:
nvc++ -std=c++2a -O2 -Mlarge_arrays -mp=gpu -gpu=cc80 -Minfo=mp,accel -Minline -o program.x program.cpp
I am running the code with:
srun -n 1 -N 1 -G 1 -c 1 --cpu-bind=cores numactl --interleave=all nsys profile -o prof ./program.x 2>sim_err.log 1>sim_out.log
The error is:
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found
The desired loop/openmp directive structure is:
#pragma omp target teams distribute parallel for reduction(+:PE) default(none) private(A,B,pc) shared(acc,aacc,world_rank,world_size,Ha,write_step,lllen,R,pos,vel,m,w,u_r,u_s,moi,kin,kout,distances,h_min,dt) defaultmap(none) map(tofrom:acc[0:num_particles],aacc[0:num_particles],PE) map(to:write_step,vel[0:num_particles],m[0:num_particles],moi[0:num_particles],pos[0:num_particles],distances[0:num_pairs])
for (pc = 1; pc <= num_pairs; pc++)
{
//Too much loop code to add here
}
The code runs just fine with CPU threads only (everything is initialized, at least on the host). Any idea what this error could mean or things to try to force the code onto the GPU? Apologies in advance if this is the wrong place for this. Let me know and I’m happy to move it. I’m also happy to provide other information if needed. Thanks in advance!