Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK

Originally published at:

Fortran developers have long been able to accelerate their programs using CUDA Fortran or OpenACC. Now with the latest 20.11 release of the NVIDIA HPC SDK, the included NVFORTRAN compiler automatically accelerates DO CONCURRENT, allowing you to get the benefit of the full power of NVIDIA GPUs using ISO Standard Fortran without any extensions, directives,…

1 Like

I heard about this in SC18. Glad that it has finally become reality! super useful. looking forward to full implementation without the limitations mentioned in the blog post.

When is it scheduled for release?

We appreciate your interest! We are continuing to work on removing limitations, so be sure to keep an eye out for upcoming releases with improvements.

1 Like

Hi. The DO CONCURRENT implementation will be available in our next release of the NVIDIA HPC SDK, version 20.11.

1 Like

This was a great article! It appears that all the discussions and example are based on accelerating Fortran without any need for CUDA programming but only on one single GPU .

From my work so far on multi-GPU programming, invoking two GPUs and partitioning the data in between always needs some CUDA related code – for instance, binding an MPI rank or a thread to one of the GPUs, or using CUDA Streams for simultaneous use of multiple GPUs and probably other approaches to enable accelerations on multi-GPUs. all need selecting the device one way or another which needs CUDA.

All of these at the very least need selecting the device one way or another, which then needs CUDA and hence is in the opposite direction of “Accelerating Fortran with a GPU Using stdpar”, where the goal is to not change the CPU-based code (with no CUDA runtime API, etc.) and compile the code simply with NVFORTRAN.

Perhaps, it would have been ideal in an MPI-based code – that consists of i) simultaneous use of CPUs to solve each sub-domain/array as well as ii) possible CPU-to-CPU communication – compiling with NVFORTRAN and stdpar automatically offloads a Do loop that is inherently within MPI process #0 to GPU0 & offload the same loop within MPI process #1 to GPU1 and so on and so forth. That way, a code platform that is already massively parallelized with MPI on CPUs could run on and utilize multi-GPU environments equally as well with NO change in any of the paradigms. If only this were possible…

**I’m very curious if there is already any way around this, and if not is this something to look forward to in the future?** I’d appreciate any insights.

Indeed, multi-gpu programming is an important use-case and we are already looking into ways to make that easier, including with stdpar. I’ll also mention that we are planning to publish another blog post about more advanced usage of DO CONCURRENT, so keep an eye out for that as well.

Is the Tensor Core used in this way?

I’m running into an error that is making me scratch my head. when using the DO CONCURRENT command.

“NVFORTRAN-F-0000-Internal compiler error. Missing end DO CONCURRENT region block”

Anybody else running into this error? And for the record, yes I do have an END DO at the end of my very simple loop.

Hi. The DO CONCURRENT feature is accelerated using the compute SMs of the GPU. However, the tensor cores are activated when using ISO Fortran array intrinsics, as described in another developer blog, Bringing Tensor Cores to Standard Fortran.

Hi, can you please show me your code and how do you compile it?

Here is the test program:

program test

implicit none

integer :: i,j
integer, parameter :: m=10000, n=10000
real :: a(m,n),b(m,n),c(m,n)

do concurrent(i=1:n, j=1:m)
end do

end program test

Here is the compile command:
nvfortran -stdpar=gpu,multicore test.f90 -o test
Here’s the execute command:

This is probably a bug. Following change works

!real :: a(m,n),b(m,n),c(m,n)
real,allocatable,dimension(:,:) :: a,b,c

nvfortran -stdpar=gpu,multicore forum.f90 -o test -Minfo

         10, Generating Tesla code
             10, Loop parallelized across CUDA thread blocks, CUDA threads(128) blockidx%x threadidx%x
                 Loop parallelized across CUDA thread blocks ! blockidx%y
         10, Generating Multicore code
             10, Loop parallelized across CPU threads

Yep, works on my system too. Thanks!

BTW, any timeline on the next update for Do Concurrent?

I’ve got another bug for you. I increased the dimensions of the arrays by 1. If the array size becomes too large the code quits with the message “Killed”. The code will run with l=10.

program test

implicit none

integer :: i,j,k
integer, parameter :: m=10000, n=10000, l=100
!real :: a(m,n),b(m,n),c(m,n)
real,allocatable,dimension(:,:,:) :: a,b,c
real :: start, finish

do concurrent(i=1:n, j=1:m, k=1:l)
end do

end program test

nvfortran -stdpar=gpu,multicore -Minfo test.f90 -o test

Mea Culpa. It looks like this is a memory limit that has nothing to do with NVFortran and the Do Concurrent loop. It looks like this is a fortran related memory limitation.

In terms of the next update - There are bug fixes in every release. However, your original code won’t work unless you have a truly unified memory system or IBM power system with ATS enabled. You allocate arrays on the host stack that cannot leverage CUDA-managed memory. I hope I am clear.


I have a small problem when attempt to write a matrix to a file using nvfortran compiler,

I wrote a code that calculate a matrix S(c,c), where c=500 or >500.

DO CONCURRENT (a=1 :c, b=1:c) local(soma1,soma2,soma3)

soma1=some expression

soma2=some expression

soma3=some expression



Then program is compiled:

nvfortran -stdpar=gpu -Minfo=accel corr.f90 -o corr_gpu

Everything was as expected until a try the fallowing code lines.


DO i=1,c,1

WRITE(321,*) (corr(i,j),j=1,c)


Instead of an output file with c columns and c rows

value1 value2 value3 value4 value5 value6 value7 value8 … valuec

value1 value2 value3 value4 value5 value6 value7 value8 … valuec

value1 value2 value3 value4 value5 value6 value7 value8 … valuec

value1 value2 value3 value4 value5 value6 value7 value8 … valuec


I get an output file with 4 columns.

value1 value2 value3 value4

value5 value6 value7 value8


valuec value1 value2

Using gfortran compiler, gfortran corr.f90 -o corr_linear, the output file is a matrix of c X c

does someone have any ideas why this difference occurs?

Solution: Write output data file with more than 3 columns

I am beginning my learning in programming NVIDIA GPUs, and just read this very instructive and useful article. I would like to use DO CONCURRENT and the HPC SDK to accelerate several of my Fortran programs, which need to run on Windows, using GPUs. I saw that the HPC SDK is not available yet for Windows. I would like to ask the authors @gozen or @grahamlopez if there is any estimate of when this Windows version will become available.

I also saw on another post (WSL and PGI compiler works great!) that one can enable the Windows Subsystem for Linux (WSL) on a Windows 10 machine, install the HPC SDK in WSL, compile Fortran code with the NVFORTRAN compiler, and then (apparently) invoke the resulting executable from Windows. This may provide a solution to accelerate my Fortran code under Windows until a HPC SDK version for Windows is released, but only for Windows 10. Is there any way to run the resulting Linux executable under previous Windows versions, in which WSL is not available?

Hello! We are currently working on bringing the HPC SDK and the HPC Compilers to Windows; we hope to make an announcement about this later this year. As to your second question, I do not know of a way to run WSL executables in older versions of Windows that do not support WSL.