Roadmap for CUDA Fortran?

I know this is a vague question, but I’m happy with whatever details can be shared. I have found CUDA Fortran to be my favorite way of using CUDA, and there’s no AMD equivalent which makes it unique in the space of HPC on GPUs.

Is there any roadmap at all, or is it just bug fixes and CUDA library support for the foreseeable future? [if so, no need to answer anything below]

  1. Is there a roadmap for Fortran2008+ Standard compatibility?
    — Specifically, I am thinking of coarrays+CUDA. This could further simplify multiprocessor computing with CUDA, and I think would boost nvfortran popularity amongst the Modern Fortran community.
    — I have seen some posts on here in the past of developers commenting they are waiting on Flang or other such LLVM Fortran compilers to be more complete before tackling further Standard compatibility. Is that still the status?
  2. I notice that nvfortran sometimes lags some in CUDA features first implemented in nvcc. Are there CUDA features that aren’t planned to be implemented in CUDA Fortran?
  3. Anything else relevant to the question?
1 Like

Hi adench2,

Sorry for the late reply. I accidently missed your post.

Basically, we just add to CUDA features to Fortran as they are added to CUDA C/C++ but are not C++ specific. Yes, there is a bit of a lag, but eventually many items are added depending on the relevancy to Fortran and usefulness to HPC based applications. Though I don’t have any specific items that are being worked on right now.

Is there a particular new feature in CUDA that you’re looking for?

The use of coarrays and CUDA is an interesting topic but it’s still too early to know how they will fit together. First we need to add coarray support in the host compiler which is being done in conjunction with our Flang community effort. Once complete, we’ll then revisit what coarrays means in CUDA.


1 Like

Hi Mat,

Thanks a lot for the reply! It was a vague question, I didn’t expect a fast answer either way.

I was interested in the CUDA 11.0 cuda::memcpy_async function for asynchronous global->shared transfers and its associated synchronization functions, and I don’t see its usage mentioned anywhere in the CUDA FORTRAN programming guide online here. It appears to be part of libcu++ - I am not sure what that means for this functionality in CUDA Fortran.

I’ll eagerly anticipate the future of this then.

1 Like

“cuda::memcpy_async” is a C++ template so couldn’t be used directly, but I’ll pass this request on to see if there’s way to mimic the behavior in Fortran.

Thanks! It seems like it would be quite useful for large scale HPC applications.

In the nvfortran wmma module, we have some support for CUDA 11 pipelineMemcpyAsync(), piplineCommit(), and pipelineWaitPrior(). But, unfortunately, it kind of languished without a good use case. I’m not sure we even documented it. With upcoming CUDA 12 features, we’ll revisit that, and I think you can expect at least support for thread block clusters (or arrays of thread blocks) and the means to program them, using async transfers, from CUDA Fortran. It might require a revamp of our cooperative groups module.

1 Like