They don’t need to be, but it is better practice for read-only variables. By default, Fortran passes by reference where the address could be taken by a global reference. It would be uncommon, but possible. This then can prevent parallelism and the compiler may not be able to implicitly privatize the scalar. Granted, you’re using “pure”, so the compiler can assume no side effects lessening the need, but I still consider it best practice.
Q2: Do “device” routines called by “global” routines need to be labelled with “!$acc routine seq”?
If the routine’s definition is in the same scoping unit, then it can often implicitly generate the device routine. It’s only required if the definition is in a separate file.
Q3: For device routines, do I/will I need to specify “value” attributes for scalar arguments for STDPAR and OpenACC (in CUDA I do not).
Same answer as Q1. It’s not required, but better practice, including CUDA Fortran. For CUDA it’s also a good idea to use value when passing scalars to the global kernel. Then the arguments are stored local to the kernels as opposed as the kernel needing to fetch the value from global memory.
Q4: Is there a syntax with OpenACC or unified memory regarding “constant” memory access attributes for variables?
I believe the OpenACC technical committee has discussed this in the past, but I’m not sure where they’re at on it. If I remember correctly, the resistance to it is that “constant” is more NVIDIA specific and they were instead looking at more general memory placement operations.
Our compiler will attempt to implicitly use constant memory for parameters.
Will this access procedure eventually become unneeded with unified memory, performance-wise?
My personal view is that constant isn’t really needed now, irrespective of UM. It used to be a physically separate memory but now is interrogated. Hardware caching has gotten very good so I don’t see the need for constant. Maybe for a large read-only array, but scalars, not so much. Granted I have not done a formal study, so this is based off my own perception (i.e. I could be wrong).
Memory placement is important with UM, primarily if the GPU should fetch directly from host memory or copy some data to the device memory. I think of UM now more akin to NUMA so you want the data stored in the memory closest where it’s being computed on. Placement is done by the CUDA runtime with our compiler implicitly adding hints via calls to cudaMemAdvise. Users can call cudaMemAdvise directly, but we’d rather make it so you don’t have to.
While there will be exceptions given the Fortran standard has limitations, I see the ultimate goal as being able to have a pure STDPAR code with no extensions and have it run just as fast as if you added them. Thus achieving both portability and performance.