Compiler failed to translate accelerator region (see -Minfo messages): Unexpected address of constant

wyphan · March 30, 2021, 4:55pm

Hi,

I’m testing code from a collaborator that implements Kronecker products as OpenACC and OpenMP target offload kernels:

I’m specifically testing the OpenACC version. When compiling it with NVHPC 21.2 using the provided build_pgi.sh script, nvfortran emits that error message referenced in the title at the stage when it compiles the double complex version of the code.

NVFORTRAN-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected address of constant (kron_mod.f90: 103)

Furthermore, when trying to inline the code for the OpenACC GEMM implementation, the compiler emits the following error message:

NVFORTRAN-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Could not find allocated-variable index for symbol - ..inline (kron_mod.f90: 103)

Notice that it’s the exact same line in kron_mod:

!$acc  kernels present(A1,X,Y)

The full output of the compile script can be found here.

Any help would be appreciated.

MatColgrove · March 30, 2021, 9:23pm

Hi wyphan,

I’m not too sure what’s wrong but it does appear to be a compiler issue when handling the double complex arrays. The “-DUSE_DOUBLE” case compiles correctly.

I’ve added an issue report, TPR #29839, and sent it to engineering for further investigation.

Is this a new project that you’re working on with Ed or has this code worked in the past? I ask because as far as I can tell with the double complex version, the compiler has never been able to compile it. I checked as far back as the 15.x compilers and see the same issue.

Thanks,
Mat

wyphan · March 30, 2021, 9:28pm

Hi Mat,

I believe this is a new code that Ed just recently ported from C++.

We’re looking into implementing FFT in OpenACC using Kronecker products. (That’s the reason for the Matlab files in the repo, which is proof of concept of the algorithm). We’re doing this because the FFT grid dimensions are too small (3-D complex-to-complex, at only ~50x50x50) to justify calling cuFFT, so we decided to implement a ZGEMM-based algorithm instead.

Thanks,
Wil

bleback · March 30, 2021, 9:38pm

Hi Wil (and Ed),

This issue came to us through another channel yesterday. One work-around for the first issue that I found was to change the alpha and beta scalars to an “alphabeta” array of length 2. Mat has opened a bug for the issue, which should work, but seems we are having trouble with complex*16 scalars we believe to be constants.

Still looking at the inlining issue, but Fortran inlining shouldn’t be necessary here if the routines are in the same file. CUDA will inline them.

Brent

wyphan · March 30, 2021, 10:02pm

Hi Brent,

I can confirm that the workaround works, but I think we’ve run into another issue here with parallelization. Here’s the -Minfo output for kronmult2:

kronmult2:
    158, Generating copyin(a1(:,:)) [if not already present]
         Generating create(w(:,:)) [if not already present]
         Generating copyin(nn) [if not already present]
         Generating copyout(y(:,:)) [if not already present]
         Generating copyin(x(:,:),ld3,mm,alphabeta(:),kk,ld1,ld2,a2(:,:)) [if not already present]
    160, Generating present(x(:,:),w(:,:),a1(:,:))
    162, Loop carried dependence due to exposed use of alphabeta(:) prevents parallelization
         Inner sequential loop scheduled on host
         Accelerator serial kernel generated
         Generating Tesla code
    163, Reference argument passing prevents parallelization: ld2
         Reference argument passing prevents parallelization: ld1
         Reference argument passing prevents parallelization: kk
         Reference argument passing prevents parallelization: nn
         Reference argument passing prevents parallelization: mm
         Reference argument passing prevents parallelization: ld3

In the code I’m working on that will have code parts from this repo, the GEMM subroutine (called ZGEMM_acc there) resides in a different module, which is why we’re looking into inlining.

Thanks,
Wil

bleback · March 30, 2021, 10:31pm

Dang, this shouldn’t be that hard. For the integer arguments passed by reference, I thought that we would be okay with that since the arguments are marked as intent(in). I think that used to be a solution, but will need to go back and check if that wasn’t the case. In any event, you can make that work (and will result in better GPU performance) if you pass those integer variables by value in device code. Then it will parallelize. Also, if you mark the loop in kronmult2 as “independent” we will parallelize that. Or, use ACC parallel instead of kernels. Still the inlining problems. They seem to be the result of using reshape on inlining, but you need reshape because of how the A matrix is passed. Lets try to get the non-inlined cases working first :)

wyphan · March 30, 2021, 10:39pm

Hi Brent,

Yes, I can confirm that after adding the independent clause, the compiler seems happy with parallelizing that loop.

On switching kernels to parallel: is this a case of the compiler implementation where parallel is implemented “better” / more stable than kernels, as in GCC?

Ed, I think I need your opinion here on kernels vs parallel.

Thanks,
Wil

bleback · March 30, 2021, 11:06pm

One is not better than the other, but they have differences. As you’ve seen, kernels puts more freedom and responsibility on the compiler, and it may choose to even run a portion on the host if it cannot determine that it is safe to run it in parallel on the GPU. Parallel is more of an assertion to “Run this in parallel”, and the independent clause means pretty much the same thing: it is safe to run this in parallel. Good job actually checking the Minfo output. Sometimes things like this can go undetected.

dazevedoef · March 31, 2021, 2:06pm

Would you kindly also try the OpenMP target offload directives (something like “-DOMP_TARGET -U_OPENACC -mp=gpu”) with Nvidia compiler? I think the OpenMP target offload directives work on IBM XL compiler. Thank you very much.

bleback · June 3, 2021, 5:58pm

The original issue here, “Unexpected address of constant” error, has been addressed in the NVIDIA HPC SDK version 21.5.

wyphan · June 3, 2021, 7:06pm

Awesome, now we just need to wait till NVHPC 21.5 or newer is installed on Summit (hopefully soon, esp. there’s a major maintenance next week). Right now only NVHPC 21.3 is available there, so I can’t test yet…

Topic		Replies	Views
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20459	October 25, 2017
NVFORTRAN Compiler Error (HPC SDK 20.9) nvc, nvc++ and nvfortran	8	2593	November 3, 2022
NVC++-S-0155-Compiler failed to translate accelerator region Legacy PGI Compilers	3	446	March 26, 2024
PGF90-F-0155-Compiler failed to translate accelerator region Legacy PGI Compilers	6	9361	December 6, 2013
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	252	December 28, 2024
understanding problems with acc directives. Legacy PGI Compilers	7	12767	May 3, 2010
Compiler failed to translate accelerator region Legacy PGI Compilers	9	6832	June 26, 2013
NVFORTRAN-S-0000-Internal compiler error. Call in OpenACC region to support routine - pgf90_dev_common_addr (cuf_nspt_trpcm_mod.CUF: 38) nvc, nvc++ and nvfortran	8	130	June 6, 2025
NVHPC 26.1 fort2 TERMINATED by signal 11 nvc, nvc++ and nvfortran nvbugs	7	66	March 12, 2026
Nvfortran 25.11: internal compiler error with OpenACC code nvc, nvc++ and nvfortran	5	77	February 11, 2026

Compiler failed to translate accelerator region (see -Minfo messages): Unexpected address of constant

Related topics