NVFORTRAN-S-0000-Internal compiler error. Call in OpenACC region to support routine - pgf90_dev_common_addr (cuf_nspt_trpcm_mod.CUF: 38)

Dear all,

The following cuf kerel do region shows the Internal compiler error.
== code ===

17 type(dd_complex), device, allocatable :: u(:,:,:)
18 type(dd_complex), device, allocatable :: XX(:,:,:)
19 type(dd_complex), device, allocatable :: XP(:,:,:)

38 !$cuf kernel do(3) <<<,>>>
39 do k=1,NORDER
40 do jc=1,COL
41 do ic=1,COL
42 ip = mod(ic-1+1 ,COL)+1
43 jp = mod(jc-1+1 ,COL)+1
44 im = mod(ic-1-1+COL,COL)+1
45 jm = mod(jc-1-1+COL,COL)+1
46 call assign_add( XP(ic,jc, k), u(jp,ip,k), u(jm,im,k) )
47 call assign_mul( ztmp, u(jc,ic,k), cuf_m_conjg_phase_gamma2_gamma2(ic,jc) )
48 call assign_add( XP(ic,jc, k), XP(jp,ip,k), ztmp)
49 XP(ic,jc, k)%ddc(3:4) = - XP(jp,ip,k)%ddc(3:4)
50 enddo
51 enddo
52 enddo

The error meesages are


NVFORTRAN-S-0000-Internal compiler error. Call in OpenACC region to support routine - pgf90_dev_common_addr (cuf_nspt_trpcm_mod.CUF: 38)
NVFORTRAN-S-0155-Compiler failed to translate accelerator region (see -Minfo messages) (cuf_nspt_trpcm_mod.CUF: 38)
cuf_get_force_trpcm:
307, include ‘cuf_get_force_trpcm.h90’
41, Generating implicit private(ip,jp,jm,im)
CUDA kernel generated
41, !$cuf kernel do <<< (,,*), (64,2,4) >>>
47, Accelerator restriction: unsupported call to support routine ‘pgf90_dev_common_addr’

What does this message mean?
index calculations using mod(,) could be the source of the error?

Best,
Ken-Ichi

Hi Ken-Ichi,

The error means that one of the compiler’s host runtime routines, which is not supported on the device, is getting added to the device code. Why this is the case, I’m not sure.

The routine “common_addr” is used to look up the device address of a common block or module which would then get passed into the kernel.

My best guess is that there’s something in your “assign” routines that’s directly accessing a module variable or common block.

Are you able to provide a reproducing example so I can investigate?

Thanks,
Mat

Dear Mat,

Thank you for your explanation.
My current workaround was simply extracting the do-loop kernel as the explicit global-kernel and the kernel works well.

I will try to create the reproducer, but the entire code set is quite large and in developping, so it could be challenging to make a simplified reproducer.

Best,
Ken-Ichi

Dear Mat,

I have extracted the part to reproduce the corresponding error.
How can I share the code?

Best,
Ken-Ichi

Hi Ken-Ichi,

If it’s ok to share publicly, you can attach it to a post (look for the icon of the computer with an up arrow).

If it’s not public, you can direct message me by opening the link on my user name and select the “message” button. You can either attach the example there, or I can email you so we can coordinate.

-Mat

Dear Mat,

Thank you. I found the ion in this editing reply message box.
This contains Makefile, and the source code is a reduced version but still contains irrelevant subroutines related to this issue.

Best,
Ken-Ichi

NVF_ERROR_REPRODUCER.tar.gz (6.3 KB)

Thanks Ken-Ichi, I was able to reproduce the error here and tracked it down to passing an element of “cuf_m_phase_gamma2_gamma2(ic,jc)” to “assign_mul” at line 79 of “cuf_update_p_trpcm.h90”.

The problem being that this array is a static module device array of a derived type. In order to pass the element given it’s a static array, the compiler needs to get the base address of the module along with the offset. However since the module is on the host, its using a host-only runtime routine.

I’ve added a problem report, TPR #37476, and sent it to engineering for investigation.

One work around is to make “cuf_m_phase_gamma2_gamma2” an allocatable array. You’ll need to then allocate at runtime, but then the compiler can just pass in the device address rather having to find the offset from the base module address.

Thanks for the report!
Mat

Dear Mat,

Thank you for the detailed explanation on the issue. I understand the situation.
I have a question on this issue:
The global-kernel version seems to be working with the statically allocated module device array in the full code set. Is this behavior expected behavior, or does it conform to CUDA Fortran behavior?

Best,
Ken-Ichi

It’s likely situational and the compiler isn’t doing the correct thing in this case. Hence why I want our engineers to look at. I’ll let you know if they come back saying it’s a limitation, but right now I consider it a compiler bug.