Fortran MINVAL/MAXVAL with stdpar


I was wondering if I could get some clarification about something.

I have the following in my Fortran code:


where mask is a logical array.

When using OpenACC, in order to parallelize this on the GPU I use:

!$acc kernels default(present)
!$acc end kernels

And this works, with the compiler (22.7) saying:

  41402, Generating default present(mask(:,:,:),field_ratio(:,:,:))
  41403, Loop is parallelizable
         Generating NVIDIA GPU code
      41403,   ! blockidx%x threadidx%x auto-collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
             Generating implicit reduction(min:field_ratio$r)

I am in the process of converting the code to use standard parallelism as much as possible, so I have first been removing all kernels that I can.

My question is that if I compile with -stdpar=gpu, will the MINVAL automatically be compiled to the GPU?

If yes, if I use -nomanaged (manual data movement with OpenACC), would I need to do something like:

!$acc host_data use_device(field_ratio,mask)
!$acc end host_data


A last question would be that what happens if I have a MINVAL call that I want on the CPU (during initialization) on data that may or may not have a copy on the GPU, would the compiler transfer to and from the GPU with managed memory? What if I am using -nomanaged? Would it run on the CPU if there is no GPU version of the arrays, or would it try to run on the GPU?

It would seem to me that with managed memory, it should compute on the GPU even if that means slow transfers, but if -nomanaged is being used, it should only run on the GPU if contained in a host_data region? Or is that mixing apples and oranges? (stdpar vs openacc)?

– Ron

It will not run on the GPU without acc kernels around it. We do not automatically offload any F90-style array intrinsics. We are considering some changes to that though, in the future.

Right now, acc kernels works because the code for doing the minval gets inlined into our compiler early, and then the normal flow just works.

Also, using CUDA Fortran, we have device functions (actual overloaded function calls) for many functions, including minval. But that requires that the compiler recognizes the arrays have either the CUF managed or device attribute. And requires you do “use cudafor” in the program unit.

I’ve been working on many more functions, sort of what we did for matmul, transpose, and reshape, described in a blog I wrote a couple of years ago. Probably the next step is to just always enable those, and “do the right thing” based on the HW and whether the data can be accessed from the GPU or not. We still need a way for the programmer to override the compiler’s default decision, addressing your last two paragraphs.