I’m trying to migrate an OpenMP codebase to OpenMP target offloading, and naturally compiling it with nvc++ for the first time as well.
I’ve now managed to build my code with OpenMP completely turned off. However, if I turn it on, no matter whether it’s on GPU or CPU and seemingly independent of optimization settings, I end up with output similar to the following:
/domus/h1/nettel/nvhpc/Linux_x86_64/21.3/compilers/share/llvm/bin/llc
/scratch/nvc++g6igsB3eWw9b.ll -march=x86-64 -mcpu=native -O0 -fast-isel=0 -non- global-value-max-name-size=4294967295 -x86-cmov-converter=0 --frame-pointer=none -o
/scratch/nvc++26igI1YMkJdo.s
/domus/h1/nettel/nvhpc/Linux_x86_64/21.3/compilers/share/llvm/bin/llc: error:
/domus/h1/nettel/nvhpc/Linux_x86_64/21.3/compilers/share/llvm/bin/llc:
/scratch/nvc++g6igsB3eWw9b.ll:9237:22: error: use of undefined value '%.F0063.addr'
%3 = load i32, i32* %.F0063.addr, align 4, !dbg !5681
^
nvc++-Fatal-llc completed with exit code 1
Unlinking /scratch/nvc++M6igYCJYIcXM.il
Unlinking /scratch/nvc++w6igcd-rfoMK.s
Unlinking /scratch/nvc++g6igsB3eWw9b.ll
Unlinking /scratch/nvc++26igI1YMkJdo.s
Unlinking /scratch/nvc++M6igYHjWaNhU.llvm
I’ve not found any easy flag to make it keep temporary files at that pass. -Mkeepasm is not really appropriate for this.
Again, I get this independently of what -mp target I choose (gpu or multicore). Any hints on how to troubleshoot this, or what would be needed to report it as a bug?
This looks like a code generation bug. Are you able to provide a reproducing example the exhibits the error? If small, you can post inline or provide a link. If the code is not something you want to post publicly, feel free to direct message me and we can arrange a way for you to send us the example.
I realized that an array reduction could be the culprit, and indeed I get it to compile if I drop the reduction there. From that, I realized that I had a macro problem, so I was in fact using the old openmp pragma from my GCC build, rather than an attempt to use proper OpenMP target pragmas. Hence, I guess the bug might not be very relevant.
Anyway, when I tried to reduce it to a sharable minimal case, I started getting a different error, so it’s not a perfect repro, but I get an internal compiler error instead. If it’s still relevant for you and that’s not enough, we’ll have to find a way to share the full code privately.
(The full code does not look completely like this, if it did I realize I could do the reduction in a number of more reasonable ways.)
bash-4.2$ /home/nettel/nvhpc/Linux_x86_64/2021/compilers/bin/nvc++ repro.cpp -mp=gpu -gpu=cc75
NVC++-S-0053-Illegal use of void type (repro.cpp: 21)
NVC++-S-0053-Illegal use of void type (repro.cpp: 21)
NVC++-F-0000-Internal compiler error. ll_abi_complete_arg_info: void function argument 0 (repro.cpp: 21)
NVC++/x86-64 Linux 21.3-0: compilation aborted
repro.cpp (752 Bytes)
Thanks Carl. We have a few known issues with array reduction support in OpenMP (we just added it recently), including this one. We have a fix being tested in our development compiler which, assuming testing goes well, will be available in a future release.
1 Like
Thanks for the information. I managed to work around the problem somewhat by using a Thrust reduce by key for that specific loop.
However, I can also note that trying to do the same thing with just atomics sometimes gave errors, but minor rearrangements fixed those. (Obviously, performance is not good when you hammer the same addresses with atomics.)
I finally just wanted to try another naïve approach.
#pragma omp target teams distribute
for (int i = 0; i < num_haps; i++)
{
#pragma omp parallel for reduction(+ : sums0[i])
for (int j = 0; j < num_haps; j++)
{
sums0[i] += table[i * num_haps + j]);
}
#pragma omp parallel for reduction(+ : sums1[i])
for (int j = 0; j < num_haps; j++)
{
sums1[i] += table[j * num_haps + i];
}
}
Even with some slight variations, this just resulted in:
nvc+±Fatal-/domus/h1/nettel/nvhpc/Linux_x86_64/21.3/compilers/bin/tools/cpp1 TERMINATED by signal 11
I have another issue with a lambda capture as well now, but I will try to rule out that it somehow references a stack-allocated pointer before posting.
I already got a 5x speedup with the Thrust + OpenMP Target approach against my CPU version with a T4 (and arguably very old CPU cores), so using nvc++ has been a blast, when cheating using managed memory allocation… From the docs I gather this is really only supported for OpenACC.
21.5 should be out here soon so you try again once available. Our OpenMP Target Offload support is very new so we’re still working through issues. I don’t know if 21.5 will fix your issue, but there’s a lot of fixes going in, so it may.
Though, you can also try using the OpenMP “loop” construct instead:
#pragma omp target teams loop
for (int i = 0; i < num_haps; i++)
{
#pragma omp loop reduction(+ : sums0[i])
for (int j = 0; j < num_haps; j++)
{
sums0[i] += table[i * num_haps + j]);
}
#pragma omp loop reduction(+ : sums1[i])
for (int j = 0; j < num_haps; j++)
{
sums1[i] += table[j * num_haps + i];
}
}
I have another issue with a lambda capture as well now, but I will try to rule out that it somehow references a stack-allocated pointer before posting.
FYI, we’ve also started supporting C++ standard language parallelism (built on top of Thrust) if you want to give it a try. See: https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
when cheating using managed memory allocation… From the docs I gather this is really only supported for OpenACC.
No, it’s supported for OpenMP as well, and the default for C++ stdpar. It’s one of the reasons why we added the “-gpu” flag (as opposed to the older OpenACC “-ta” flag), so we could put the common GPU options for all models under a single flag. Since OpenMP target offload and stdpar are so new, our docs are just a bit behind.