Compiler failed to translate accelerator region (see -Minfo messages): Unexpected address of constant

Hi,

I’m testing code from a collaborator that implements Kronecker products as OpenACC and OpenMP target offload kernels:

I’m specifically testing the OpenACC version. When compiling it with NVHPC 21.2 using the provided build_pgi.sh script, nvfortran emits that error message referenced in the title at the stage when it compiles the double complex version of the code.

NVFORTRAN-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected address of constant (kron_mod.f90: 103)

Furthermore, when trying to inline the code for the OpenACC GEMM implementation, the compiler emits the following error message:

NVFORTRAN-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Could not find allocated-variable index for symbol - ..inline (kron_mod.f90: 103)

Notice that it’s the exact same line in kron_mod:

!$acc  kernels present(A1,X,Y)

The full output of the compile script can be found here.

Any help would be appreciated.

Hi wyphan,

I’m not too sure what’s wrong but it does appear to be a compiler issue when handling the double complex arrays. The “-DUSE_DOUBLE” case compiles correctly.

I’ve added an issue report, TPR #29839, and sent it to engineering for further investigation.

Is this a new project that you’re working on with Ed or has this code worked in the past? I ask because as far as I can tell with the double complex version, the compiler has never been able to compile it. I checked as far back as the 15.x compilers and see the same issue.

Thanks,
Mat

Hi Mat,

I believe this is a new code that Ed just recently ported from C++.

We’re looking into implementing FFT in OpenACC using Kronecker products. (That’s the reason for the Matlab files in the repo, which is proof of concept of the algorithm). We’re doing this because the FFT grid dimensions are too small (3-D complex-to-complex, at only ~50x50x50) to justify calling cuFFT, so we decided to implement a ZGEMM-based algorithm instead.

Thanks,
Wil

Hi Wil (and Ed),

This issue came to us through another channel yesterday. One work-around for the first issue that I found was to change the alpha and beta scalars to an “alphabeta” array of length 2. Mat has opened a bug for the issue, which should work, but seems we are having trouble with complex*16 scalars we believe to be constants.

Still looking at the inlining issue, but Fortran inlining shouldn’t be necessary here if the routines are in the same file. CUDA will inline them.

  • Brent

Hi Brent,

I can confirm that the workaround works, but I think we’ve run into another issue here with parallelization. Here’s the -Minfo output for kronmult2:

kronmult2:
    158, Generating copyin(a1(:,:)) [if not already present]
         Generating create(w(:,:)) [if not already present]
         Generating copyin(nn) [if not already present]
         Generating copyout(y(:,:)) [if not already present]
         Generating copyin(x(:,:),ld3,mm,alphabeta(:),kk,ld1,ld2,a2(:,:)) [if not already present]
    160, Generating present(x(:,:),w(:,:),a1(:,:))
    162, Loop carried dependence due to exposed use of alphabeta(:) prevents parallelization
         Inner sequential loop scheduled on host
         Accelerator serial kernel generated
         Generating Tesla code
    163, Reference argument passing prevents parallelization: ld2
         Reference argument passing prevents parallelization: ld1
         Reference argument passing prevents parallelization: kk
         Reference argument passing prevents parallelization: nn
         Reference argument passing prevents parallelization: mm
         Reference argument passing prevents parallelization: ld3

In the code I’m working on that will have code parts from this repo, the GEMM subroutine (called ZGEMM_acc there) resides in a different module, which is why we’re looking into inlining.

Thanks,
Wil

Dang, this shouldn’t be that hard. For the integer arguments passed by reference, I thought that we would be okay with that since the arguments are marked as intent(in). I think that used to be a solution, but will need to go back and check if that wasn’t the case. In any event, you can make that work (and will result in better GPU performance) if you pass those integer variables by value in device code. Then it will parallelize. Also, if you mark the loop in kronmult2 as “independent” we will parallelize that. Or, use ACC parallel instead of kernels. Still the inlining problems. They seem to be the result of using reshape on inlining, but you need reshape because of how the A matrix is passed. Lets try to get the non-inlined cases working first :)

Hi Brent,

Yes, I can confirm that after adding the independent clause, the compiler seems happy with parallelizing that loop.

On switching kernels to parallel: is this a case of the compiler implementation where parallel is implemented “better” / more stable than kernels, as in GCC?

Ed, I think I need your opinion here on kernels vs parallel.

Thanks,
Wil

One is not better than the other, but they have differences. As you’ve seen, kernels puts more freedom and responsibility on the compiler, and it may choose to even run a portion on the host if it cannot determine that it is safe to run it in parallel on the GPU. Parallel is more of an assertion to “Run this in parallel”, and the independent clause means pretty much the same thing: it is safe to run this in parallel. Good job actually checking the Minfo output. Sometimes things like this can go undetected.

1 Like

Would you kindly also try the OpenMP target offload directives (something like “-DOMP_TARGET -U_OPENACC -mp=gpu”) with Nvidia compiler? I think the OpenMP target offload directives work on IBM XL compiler. Thank you very much.