Nvfortran compilation error for stdpar

Am trying to compile (nvfortran 24.11) a code on a GH200 using unified memory and “do concurrent.” I have attached a build directory with Makefile. The issue is in the following region of “chmrhs_gpu3h.f”
do concurrent(mdp=ml_stream(n):mu_stream(n))
call chmrhs_gpu(mdp, icnt_max)
call thermal_perfect_gpu(mdp)
enddo

I want the two routines shown to be run on the gpu. In particular, the routine chmrhs_gpu() itself calls “device” routines. The problem currently encountered is that I get an error that I cannot understand …

. . .
NVFORTRAN-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): No device symbol for address reference (chmrhs_dcpl_gpu3h.f: 124)
chm_simp3:
124, Generating implicit acc routine seq
Generating acc routine seq
Generating NVIDIA GPU code
NVFORTRAN-F-0704-Compilation aborted due to previous errors. (chmrhs_dcpl_gpu3h.f)
NVFORTRAN/arm64 Linux 24.11-0: compilation aborted
make: *** [Makefile:24: chmrhs_dcpl_gpu3h.o] Error 2

It appears that the compiler generated gpu code but cannot find device symbol?? The routines in question are contained within a module.

What is needed to correct this error? Thank you.

CHEM_STDPAR_TGZ.txt (82.2 KB)

Hi Don,

Thanks for the report and reproducing example. This is a compiler error where it loose track of some symbol. In this case I was able to track it down the “nmax” variable when passed in to the “spgval_lcl”.

I have added a problem report, TPR #36898, and sent it to engineering for review. I can work around it by adding the “value” attribute to “nmax”'s declaration:

!-------------------------------------------------------------------
      pure subroutine spgval_lcl(nmax,gf,t)
!$acc routine seq
!-------------------------------------------------------------------
      use common_a08, only: rgd
      use common_a09, only: ntable,gftb,cptb,sfgf,ttb,ttbi
      use common_a10, only: tinfd

      implicit none
      integer,value,intent(in):: nmax
      real(kind=4),intent(in)   :: t

After working around this one, I did hit a second unrelated compiler ICE: "Intrinsic already declared with different signature ". This is in the host code generation so reproduces with out without OpenACC enabled.

I traced this one down to using integer kind=1 variables as the exponent of pow. I reported this as TPR #36900.

The work around is to change “kind=1” to “kind=4” for the “nurm” and “nupm” variables declared in “chm_simp3”.

integer(kind=4):: nurm,nupm

Note that you have a small error in your makefile where there’s “-Minfo messages”. This should be just “-Minfo”. “messages” will get passed to the linker as file causing the link to fail.

Once I added the work arounds and fixed the makefile, the code built for me with 24.11. I tried running it, but there’s a missing input file.

CHEM_STDPAR% ./test.exe

     Runtime Working Directory: /local/home/mcolgrove/CHEM_STDPAR

     ERROR: cannot find main input file ./craft.inp

   *** Program Stopping ***

Thanks again,
Mat

Thank you so much. Regarding routines/functions called within a “do concurrent” region (or even OpenACC) :
Q1: Do scalar arguments need to be explicitly given the “value” attribute by the coder (or should it be automatic)
Q2: Do “device” routines called by “global” routines need to be labelled with “!$acc routine seq”? I understand that
recent versions of nvfortran no longer require the “global” routines (here, ones called within the do concurrent region) to be
so labelled. Kind of a routine dependency question and how it may be resolved by compiler.
Q3: For device routines, do I/will I need to specify “value” attributes for scalar arguments for STDPAR and OpenACC (in CUDA I do not).

Respectfully,
Don Kenzakowski

Q4: Is there a syntax with OpenACC or unified memory regarding “constant” memory access attributes for variables? Constant
memory, from my understanding, had faster accessibility than global memory, at least for discrete cards, but was written only by
the cpu and read by threads on gpu. Will this access procedure eventually become unneeded with unified memory, performance-wise?

Respectfully,
Don Kenzakowski

They don’t need to be, but it is better practice for read-only variables. By default, Fortran passes by reference where the address could be taken by a global reference. It would be uncommon, but possible. This then can prevent parallelism and the compiler may not be able to implicitly privatize the scalar. Granted, you’re using “pure”, so the compiler can assume no side effects lessening the need, but I still consider it best practice.

Q2: Do “device” routines called by “global” routines need to be labelled with “!$acc routine seq”?

If the routine’s definition is in the same scoping unit, then it can often implicitly generate the device routine. It’s only required if the definition is in a separate file.

Q3: For device routines, do I/will I need to specify “value” attributes for scalar arguments for STDPAR and OpenACC (in CUDA I do not).

Same answer as Q1. It’s not required, but better practice, including CUDA Fortran. For CUDA it’s also a good idea to use value when passing scalars to the global kernel. Then the arguments are stored local to the kernels as opposed as the kernel needing to fetch the value from global memory.

Q4: Is there a syntax with OpenACC or unified memory regarding “constant” memory access attributes for variables?

I believe the OpenACC technical committee has discussed this in the past, but I’m not sure where they’re at on it. If I remember correctly, the resistance to it is that “constant” is more NVIDIA specific and they were instead looking at more general memory placement operations.

Our compiler will attempt to implicitly use constant memory for parameters.

Will this access procedure eventually become unneeded with unified memory, performance-wise?

My personal view is that constant isn’t really needed now, irrespective of UM. It used to be a physically separate memory but now is interrogated. Hardware caching has gotten very good so I don’t see the need for constant. Maybe for a large read-only array, but scalars, not so much. Granted I have not done a formal study, so this is based off my own perception (i.e. I could be wrong).

Memory placement is important with UM, primarily if the GPU should fetch directly from host memory or copy some data to the device memory. I think of UM now more akin to NUMA so you want the data stored in the memory closest where it’s being computed on. Placement is done by the CUDA runtime with our compiler implicitly adding hints via calls to cudaMemAdvise. Users can call cudaMemAdvise directly, but we’d rather make it so you don’t have to.

While there will be exceptions given the Fortran standard has limitations, I see the ultimate goal as being able to have a pure STDPAR code with no extensions and have it run just as fast as if you added them. Thus achieving both portability and performance.

I have modules that (originally written for cuda separate memory) store values in constant memory that
replicate those originally stored on cpu. For example,

module craft_flags
integer:: nsm,ns,neqn
integer:: neq,nctb,ielec
end module craft_flags

module gpu_craft_flags
integer,constant:: nsm,ns,neqn
integer,constant:: neq,nctb,ielec
end module gpu_craft_flags

For the stdpar case example, it seems I NEED to use the module without the constant attribute assignments.
If I replace “use craft_flags” with “use gpu_craft_flags” at line 49 of chmrhs_gpu3h.f, I get the following compiler error …

0 inform, 1 warnings, 0 severes, 0 fatal for chmrhs_gpu
NVFORTRAN-S-1058-Call to Compiler runtime function not supported - pgf90_dev_common_addr (chmrhs_dcpl_gpu3h.f: 104)
0 inform, 0 warnings, 1 severes, 0 fatal for chmrhs_gpu

Would this be a future compiler fix or or this a legitimate coding error (ie. I used constant memory in a routine that would be running on
gpu but called from a stdpar region and so a potential compilation conflict)

Line 104 error is associated with the first argument to the “decomp_pvt” routine you mentioned yesterday.

Trying to best understand the “guardrail” practices to using directives/stdpar successfully without creating unnecessary “taboos” regarding
coding practices for more complex codes.

FYI, TPR #36900, a compiler error when using integer kind=1 as the exponent of a pow operation, has been fixed in our 25.1 release.