Issue with locally defined classes in OpenMP offload region (since NVHPC 22.5)

pramod.s.kumbhar · September 22, 2022, 11:35am

Hello!

We have following code pattern in a DSL generated code of NEURON simulator:

struct cacum_Instance  {
    double* cai;
};

void nrn_state_cacum(cacum_Instance* inst, int start, int end, int gpu) {
    #pragma omp target teams distribute parallel for if(gpu)
    #pragma acc parallel loop if(gpu)
    for (int id = start; id < end; id++) {
        struct functor {
            cacum_Instance* inst;
            functor(cacum_Instance* inst) : inst{inst} {}
        };
        inst->cai[id] += 1.0;
    }
}

Until NVHPC 22.3, the OpenMP offload version was compiling/working fine. With 22.5 and 22.7 we now get following error:

$ nvc++  -g  -O2  --c++17  -c cacumm_prep.cpp -mp=gpu -Minfo=accel
"cacumm_prep.cpp", line 11: internal error: assertion failed: lower_expr: bad kind (lower_il.cpp, line 17583 in lower_expr_full)

              functor(cacum_Instance* inst) : inst{inst} {}
                                                         ^

1 catastrophic error detected in the compilation of "cacumm_prep.cpp".
Compilation aborted.
nvc++-Fatal-/gpfs/bbp.cscs.ch/ssd/apps/bsd/pulls/1654/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.7-ltthct/Linux_x86_64/22.7/compilers/bin/tools/cpp1 TERMINATED by signal 6
Arguments to /gpfs/bbp.cscs.ch/ssd/apps/bsd/pulls/1654/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.7-ltthct/Linux_x86_64/22.7/compilers/bin/tools/cpp1
/gpfs/bbp.cscs.ch/ssd/apps/bsd/pulls/1654/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.7-ltthct/Linux_x86_64/22.7/compilers/bin/tools/cpp1 --llalign -Dunix -D__unix -D__unix__ -Dlinux -D__linux -D__linux__ -D__NO_MATH_INLINES -D__LP64__ -D__x86_64 -D__x86_64__ -D__LONG_MAX__=9223372036854775807L '-D__SIZE_TYPE__=unsigned long int' '-D__PTRDIFF_TYPE__=long int' -D__amd64 -D__amd64__ -D__k8 -D__k8__ -D__MMX__ -D__SSE__ -D__SSE2__ -D__SSE3__ -D__SSSE3__ -D__ABM__ -D__SSE4_1__ -D__SSE4_2__ -D__AVX__ -D__AVX2__ -D__AVX512F__ -D__AVX512CD__ -D__AVX512VL__ -D__AVX512BW__ -D__AVX512DQ__ -D__F16C__ -D__FMA__ -D__XSAVE__ -D__XSAVEOPT__ -D__XSAVEC__ -D__XSAVES__ -D__POPCNT__ -D__AES__ -D__PCLMUL__ -D__CLFLUSHOPT__ -D__FSGSBASE__ -D__RDRND__ -D__BMI__ -D__BMI2__ -D__LZCNT__ -D__FXSR__ -D__RTM__ -D__PKU__ -D__PGI -D__NVCOMPILER -D_GNU_SOURCE -D_PGCG_SOURCE --c++17 -I- -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/include/python3.9 --sys_include /gpfs/bbp.cscs.ch/ssd/apps/bsd/pulls/1654/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.7-ltthct/Linux_x86_64/22.7/compilers/include --sys_include /gpfs/bbp.cscs.ch/ssd/apps/bsd/pulls/1654/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.7-ltthct/Linux_x86_64/22.7/cuda/11.7/include --sys_include /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-ehzq5x/include/c++/11.2.0 --sys_include /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-ehzq5x/include/c++/11.2.0/x86_64-pc-linux-gnu --sys_include /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-ehzq5x/include/c++/11.2.0/backward --sys_include /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-ehzq5x/lib/gcc/x86_64-pc-linux-gnu/11.2.0/include --sys_include /usr/local/include --sys_include /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-ehzq5x/include --sys_include /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-ehzq5x/lib/gcc/x86_64-pc-linux-gnu/11.2.0/include-fixed --sys_include /usr/include -D__PGLLVM__ -D__NVCOMPILER_LLVM__ -D__extension__= -D_OPENMP=202011 -DCUDA_VERSION=11070 -DPGI_TESLA_TARGET -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -D_PGI_HX --preinclude _cplus_preinclude.h --preinclude_macros _cplus_macros.h --gnu_version=110200 -D__pgnu_vsn=110200 -g --dwarf2 --target_gpu --mp -D_OPENMP=202011 -D_NVHPC_RDC -q -o /tmp/nvc++z7iclJMg-z3g.il cacumm_prep.cpp

If I remove the condition if (gpu) on the offload loop or move class definition outside loop then code compiles fine.

Could you please take a look?

NOTE: OpenACC version compiles fine without any changes. So issue seems to be only with latest NVHPC releases (at least 22.3 & 22.5) and OpenMP offload.

MatColgrove · September 22, 2022, 3:26pm

Thanks for the report Pramod. I was able to reproduce the regression here and have filed a problem report, TPR #32434. No idea what change in the front-end C++ compiler to cause this, but we’ll have engineering investigate.

-Mat

pramod.s.kumbhar · September 22, 2022, 10:10pm

Thank you for confirming the issue, Mat!

I assume this also appears with the upcoming 22.9 release?

If there is any workaround engineering team
suggests then we will be certainly interested!

MatColgrove · September 23, 2022, 12:12am

Yes, it’s still in 22.9 as well.

Do you need to replicate the case where the offload is conditional at execution?

I was thinking that you could use metadirectives with a user condition, but this triggers parse error when the functor is included. The syntax for the functor is similar to a metadirective so my guess is that somehow it parses incorrectly. Though I filed TPR#32437 and will let the compiler engineers determine the true cause.

My only other thought is to duplicate the loop in a if statement, putting the target teams pragma without the “if(gpu)” in the true condition section. Not ideal, but hopefully a short-term work around.

FYI, here’s what the code would look like using a metadirective (assuming the functor line didn’t cause an error):

#pragma omp metadirective \
        when( user={condition(gpu)}: target teams distribute parallel for)  \
        default( parallel for )
    for (int id = start; id < end; id++) {
        struct functor {
            cacum_Instance* inst;
#ifndef ERROR
            functor(cacum_Instance* inst) : inst{inst} {}
#endif
        };
        inst->cai[id] += 1.0;
    }
}

pramod.s.kumbhar · October 3, 2022, 7:49pm

Do you need to replicate the case where the offload is conditional at execution?

I would yes because we would like to use same binary for CPU or GPU execution. Of course we could build separate modules / binaries but we have been trying to have single binary. This helps binary distributions.

My only other thought is to duplicate the loop in a if statement, putting the target teams pragma without the “if(gpu)” in the true condition section. Not ideal, but hopefully a short-term work around.

Ok, thanks! Most likely will try this as temporary alternative.

FYI, here’s what the code would look like using a metadirective (assuming the functor line didn’t cause an error):

Haven’t used metadirective until now. Thanks for the example!

pramod.s.kumbhar · January 23, 2023, 11:42am

Dear @MatColgrove,

We still see this issue in the latest v22.11 release. Internally (in TPR#32437) is there any feedback/plan to fix this “soonish”? (Currently we are stuck with 22.3)

It’s true that we have discussed the workarounds but implementing such workaround is non-trivial in our DSL to C++ code generation pipeline. And hence the question.

Thank you!

MatColgrove · January 23, 2023, 5:29pm

There’s two reports here. TPR #32434 is assigned to an engineer but looks to be lower on his priority list. I’ve pinged him for status.

For the issue with the metadirective work around code, #32437, this has not been assigned so doubt will be resolved any time soon.

I would yes because we would like to use same binary for CPU or GPU execution.

Is this an “either or”? i.e. when the binary is run, it’s run on either the CPU or GPU, but not together in the same run?

If this is the case, then you can remove the problematic “if(gpu)” and then use the environment variable “OMP_TARGET_OFFLOAD=[MANDATORY|DISABLED]” to control if the code is offloaded or not.

The caveat with “DISABLED” being that the loop will still be parallelized across multicore CPUs. If you do want it to run serially, also set “OMP_NUM_THREADS=1”.

Also note that if there’s no GPU on the system, the binary will use the host fallback code. Use of “OMP_TARGET_OFFLOAD” is only really needed if running on a system with a GPU but you want it to run on the host.

MatColgrove · March 31, 2023, 9:44pm

Hi Pramod,

FYI the original issue, TPR #32434, was fixed in our 23.3 release.

-Mat

Topic		Replies	Views
Nvc++ OpenMP error inside llc nvc, nvc++ and nvfortran	5	1119	June 1, 2021
[nvhpc-22.2] error: use of undefined value '%L.LB26_8163' nvc, nvc++ and nvfortran	27	2901	July 7, 2023
OpenMP offload w/ CUDA interop: undefined reference to `__fatbinwrap__NV_MODULE_ID' nvc, nvc++ and nvfortran	5	1062	May 22, 2023
Enabling OpenMP offload breaks OpenACC code nvc, nvc++ and nvfortran	6	1264	December 1, 2021
Behaviour of OpenMP target maps with Fortran arrays nvc, nvc++ and nvfortran	12	71	February 11, 2025
NVC++ ICE with OpenMP target traversing random access iterators nvc, nvc++ and nvfortran	2	655	March 31, 2023
How to map private dynamic array to the GPU with OpenMP and nvc? nvc, nvc++ and nvfortran	20	133	January 31, 2025
Creating a shared library that utilises OpenMP offloading NVHPC 22.5 nvc, nvc++ and nvfortran	5	736	June 23, 2022
Problem with the nvc++ compiler for OpeMP GPU offloading nvc, nvc++ and nvfortran	2	574	March 10, 2023
Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded nvc, nvc++ and nvfortran	8	1255	October 7, 2023

Issue with locally defined classes in OpenMP offload region (since NVHPC 22.5)

Related topics