Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK

Originally published at:

Fortran developers have long been able to accelerate their programs using CUDA Fortran or OpenACC. Now with the latest 20.11 release of the NVIDIA HPC SDK, the included NVFORTRAN compiler automatically accelerates DO CONCURRENT, allowing you to get the benefit of the full power of NVIDIA GPUs using ISO Standard Fortran without any extensions, directives,…

I heard about this in SC18. Glad that it has finally become reality! super useful. looking forward to full implementation without the limitations mentioned in the blog post.

When is it scheduled for release?

We appreciate your interest! We are continuing to work on removing limitations, so be sure to keep an eye out for upcoming releases with improvements.

1 Like

Hi. The DO CONCURRENT implementation will be available in our next release of the NVIDIA HPC SDK, version 20.11.

1 Like

This was a great article! It appears that all the discussions and example are based on accelerating Fortran without any need for CUDA programming but only on one single GPU .

From my work so far on multi-GPU programming, invoking two GPUs and partitioning the data in between always needs some CUDA related code – for instance, binding an MPI rank or a thread to one of the GPUs, or using CUDA Streams for simultaneous use of multiple GPUs and probably other approaches to enable accelerations on multi-GPUs. all need selecting the device one way or another which needs CUDA.

All of these at the very least need selecting the device one way or another, which then needs CUDA and hence is in the opposite direction of “Accelerating Fortran with a GPU Using stdpar”, where the goal is to not change the CPU-based code (with no CUDA runtime API, etc.) and compile the code simply with NVFORTRAN.

Perhaps, it would have been ideal in an MPI-based code – that consists of i) simultaneous use of CPUs to solve each sub-domain/array as well as ii) possible CPU-to-CPU communication – compiling with NVFORTRAN and stdpar automatically offloads a Do loop that is inherently within MPI process #0 to GPU0 & offload the same loop within MPI process #1 to GPU1 and so on and so forth. That way, a code platform that is already massively parallelized with MPI on CPUs could run on and utilize multi-GPU environments equally as well with NO change in any of the paradigms. If only this were possible…

**I’m very curious if there is already any way around this, and if not is this something to look forward to in the future?** I’d appreciate any insights.

Indeed, multi-gpu programming is an important use-case and we are already looking into ways to make that easier, including with stdpar. I’ll also mention that we are planning to publish another blog post about more advanced usage of DO CONCURRENT, so keep an eye out for that as well.

Is the Tensor Core used in this way?

I’m running into an error that is making me scratch my head. when using the DO CONCURRENT command.

“NVFORTRAN-F-0000-Internal compiler error. Missing end DO CONCURRENT region block”

Anybody else running into this error? And for the record, yes I do have an END DO at the end of my very simple loop.

Hi. The DO CONCURRENT feature is accelerated using the compute SMs of the GPU. However, the tensor cores are activated when using ISO Fortran array intrinsics, as described in another developer blog, Bringing Tensor Cores to Standard Fortran.

Hi, can you please show me your code and how do you compile it?

Here is the test program:

program test

implicit none

integer :: i,j
integer, parameter :: m=10000, n=10000
real :: a(m,n),b(m,n),c(m,n)

do concurrent(i=1:n, j=1:m)
end do

end program test

Here is the compile command:
nvfortran -stdpar=gpu,multicore test.f90 -o test
Here’s the execute command:

This is probably a bug. Following change works

!real :: a(m,n),b(m,n),c(m,n)
real,allocatable,dimension(:,:) :: a,b,c

nvfortran -stdpar=gpu,multicore forum.f90 -o test -Minfo

         10, Generating Tesla code
             10, Loop parallelized across CUDA thread blocks, CUDA threads(128) blockidx%x threadidx%x
                 Loop parallelized across CUDA thread blocks ! blockidx%y
         10, Generating Multicore code
             10, Loop parallelized across CPU threads

Yep, works on my system too. Thanks!

BTW, any timeline on the next update for Do Concurrent?

I’ve got another bug for you. I increased the dimensions of the arrays by 1. If the array size becomes too large the code quits with the message “Killed”. The code will run with l=10.

program test

implicit none

integer :: i,j,k
integer, parameter :: m=10000, n=10000, l=100
!real :: a(m,n),b(m,n),c(m,n)
real,allocatable,dimension(:,:,:) :: a,b,c
real :: start, finish

do concurrent(i=1:n, j=1:m, k=1:l)
end do

end program test

nvfortran -stdpar=gpu,multicore -Minfo test.f90 -o test

Mea Culpa. It looks like this is a memory limit that has nothing to do with NVFortran and the Do Concurrent loop. It looks like this is a fortran related memory limitation.

In terms of the next update - There are bug fixes in every release. However, your original code won’t work unless you have a truly unified memory system or IBM power system with ATS enabled. You allocate arrays on the host stack that cannot leverage CUDA-managed memory. I hope I am clear.