Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK

jwitsoe · November 16, 2020, 5:45pm

Originally published at: https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk/

Fortran developers have long been able to accelerate their programs using CUDA Fortran or OpenACC. Now with the latest 20.11 release of the NVIDIA HPC SDK, the included NVFORTRAN compiler automatically accelerates DO CONCURRENT, allowing you to get the benefit of the full power of NVIDIA GPUs using ISO Standard Fortran without any extensions, directives,…

HubertFarnsworth · November 18, 2020, 1:48am

I heard about this in SC18. Glad that it has finally become reality! super useful. looking forward to full implementation without the limitations mentioned in the blog post.

con2 · November 18, 2020, 2:20am

When is it scheduled for release?

grahamlopez · November 18, 2020, 3:45pm

We appreciate your interest! We are continuing to work on removing limitations, so be sure to keep an eye out for upcoming releases with improvements.

grahamlopez · November 18, 2020, 3:50pm

Hi. The DO CONCURRENT implementation will be available in our next release of the NVIDIA HPC SDK, version 20.11.

aminamooie · January 7, 2021, 5:51pm

This was a great article! It appears that all the discussions and example are based on accelerating Fortran without any need for CUDA programming but only on one single GPU .

From my work so far on multi-GPU programming, invoking two GPUs and partitioning the data in between always needs some CUDA related code – for instance, binding an MPI rank or a thread to one of the GPUs, or using CUDA Streams for simultaneous use of multiple GPUs and probably other approaches to enable accelerations on multi-GPUs. all need selecting the device one way or another which needs CUDA.

All of these at the very least need selecting the device one way or another, which then needs CUDA and hence is in the opposite direction of “Accelerating Fortran with a GPU Using stdpar”, where the goal is to not change the CPU-based code (with no CUDA runtime API, etc.) and compile the code simply with NVFORTRAN.

Perhaps, it would have been ideal in an MPI-based code – that consists of i) simultaneous use of CPUs to solve each sub-domain/array as well as ii) possible CPU-to-CPU communication – compiling with NVFORTRAN and stdpar automatically offloads a Do loop that is inherently within MPI process #0 to GPU0 & offload the same loop within MPI process #1 to GPU1 and so on and so forth. That way, a code platform that is already massively parallelized with MPI on CPUs could run on and utilize multi-GPU environments equally as well with NO change in any of the paradigms. If only this were possible…

**I’m very curious if there is already any way around this, and if not is this something to look forward to in the future?** I’d appreciate any insights.

grahamlopez · January 8, 2021, 3:15pm

Indeed, multi-gpu programming is an important use-case and we are already looking into ways to make that easier, including with stdpar. I’ll also mention that we are planning to publish another blog post about more advanced usage of DO CONCURRENT, so keep an eye out for that as well.

lkuang · February 1, 2021, 1:44am

Is the Tensor Core used in this way?

bullartj · February 11, 2021, 7:10pm

I’m running into an error that is making me scratch my head. when using the DO CONCURRENT command.

“NVFORTRAN-F-0000-Internal compiler error. Missing end DO CONCURRENT region block”

Anybody else running into this error? And for the record, yes I do have an END DO at the end of my very simple loop.

grahamlopez · February 12, 2021, 6:03am

Hi. The DO CONCURRENT feature is accelerated using the compute SMs of the GPU. However, the tensor cores are activated when using ISO Fortran array intrinsics, as described in another developer blog, Bringing Tensor Cores to Standard Fortran.

gozen · February 12, 2021, 10:40am

Hi, can you please show me your code and how do you compile it?

bullartj · February 12, 2021, 1:46pm

Here is the test program:

program test

implicit none

integer :: i,j
integer, parameter :: m=10000, n=10000
real :: a(m,n),b(m,n),c(m,n)
a=1
b=1

do concurrent(i=1:n, j=1:m)
c(i,j)=b(i,j)+a(i,j)
end do

end program test

Here is the compile command:
nvfortran -stdpar=gpu,multicore test.f90 -o test
Here’s the execute command:
./test

gozen · February 12, 2021, 2:14pm

This is probably a bug. Following change works

!real :: a(m,n),b(m,n),c(m,n)
real,allocatable,dimension(:,:) :: a,b,c
allocate(a(m,n),b(m,n),c(m,n))

nvfortran -stdpar=gpu,multicore forum.f90 -o test -Minfo

    test:
         10, Generating Tesla code
             10, Loop parallelized across CUDA thread blocks, CUDA threads(128) blockidx%x threadidx%x
                 Loop parallelized across CUDA thread blocks ! blockidx%y
         10, Generating Multicore code
             10, Loop parallelized across CPU threads

bullartj · February 12, 2021, 2:49pm

Yep, works on my system too. Thanks!

BTW, any timeline on the next update for Do Concurrent?

bullartj · February 12, 2021, 3:07pm

I’ve got another bug for you. I increased the dimensions of the arrays by 1. If the array size becomes too large the code quits with the message “Killed”. The code will run with l=10.

program test

implicit none

integer :: i,j,k
integer, parameter :: m=10000, n=10000, l=100
!real :: a(m,n),b(m,n),c(m,n)
real,allocatable,dimension(:,:,:) :: a,b,c
real :: start, finish
allocate(a(m,n,l),b(m,n,l),c(m,n,l))
a=1
b=1

do concurrent(i=1:n, j=1:m, k=1:l)
c(i,j,k)=b(i,j,k)+a(i,j,k)
end do

end program test

nvfortran -stdpar=gpu,multicore -Minfo test.f90 -o test

bullartj · February 13, 2021, 1:46pm

Mea Culpa. It looks like this is a memory limit that has nothing to do with NVFortran and the Do Concurrent loop. It looks like this is a fortran related memory limitation.

gozen · February 22, 2021, 2:07pm

In terms of the next update - There are bug fixes in every release. However, your original code won’t work unless you have a truly unified memory system or IBM power system with ATS enabled. You allocate arrays on the host stack that cannot leverage CUDA-managed memory. I hope I am clear.

bismarck-costa · March 22, 2021, 6:30pm

Hi,

I have a small problem when attempt to write a matrix to a file using nvfortran compiler,

I wrote a code that calculate a matrix S(c,c), where c=500 or >500.

DO CONCURRENT (a=1 :c, b=1:c) local(soma1,soma2,soma3)

soma1=some expression

soma2=some expression

soma3=some expression

S(a,b)=soma1+soma2+soma3

ENDDO

Then program is compiled:

nvfortran -stdpar=gpu -Minfo=accel corr.f90 -o corr_gpu

Everything was as expected until a try the fallowing code lines.

OPEN(321,FILE=name,STATUS="UNKNOWN",ACTION="WRITE",FORM="FORMATTED")

DO i=1,c,1

WRITE(321,*) (corr(i,j),j=1,c)

ENDDO

Instead of an output file with c columns and c rows

value1 value2 value3 value4 value5 value6 value7 value8 … valuec

value1 value2 value3 value4 value5 value6 value7 value8 … valuec

value1 value2 value3 value4 value5 value6 value7 value8 … valuec

value1 value2 value3 value4 value5 value6 value7 value8 … valuec

...

I get an output file with 4 columns.

value1 value2 value3 value4

value5 value6 value7 value8

...

valuec value1 value2

Using gfortran compiler, gfortran corr.f90 -o corr_linear, the output file is a matrix of c X c

does someone have any ideas why this difference occurs?

Solution: Write output data file with more than 3 columns

holvorcem · July 21, 2021, 7:56pm

I am beginning my learning in programming NVIDIA GPUs, and just read this very instructive and useful article. I would like to use DO CONCURRENT and the HPC SDK to accelerate several of my Fortran programs, which need to run on Windows, using GPUs. I saw that the HPC SDK is not available yet for Windows. I would like to ask the authors @gozen or @grahamlopez if there is any estimate of when this Windows version will become available.

I also saw on another post (WSL and PGI compiler works great!) that one can enable the Windows Subsystem for Linux (WSL) on a Windows 10 machine, install the HPC SDK in WSL, compile Fortran code with the NVFORTRAN compiler, and then (apparently) invoke the resulting executable from Windows. This may provide a solution to accelerate my Fortran code under Windows until a HPC SDK version for Windows is released, but only for Windows 10. Is there any way to run the resulting Linux executable under previous Windows versions, in which WSL is not available?

grahamlopez · July 26, 2021, 6:37pm

Hello! We are currently working on bringing the HPC SDK and the HPC Compilers to Windows; we hope to make an announcement about this later this year. As to your second question, I do not know of a way to run WSL executables in older versions of Windows that do not support WSL.

Topic		Replies	Views
Using Fortran Standard Parallel Programming for GPU Acceleration Technical Blog	8	584	July 11, 2024
Translating FORTRAN to C++ to CUDA advice CUDA Programming and Performance	19	23256	February 1, 2010
Fortran code not compiling for GPU Legacy PGI Compilers	11	7361	August 23, 2017
Problems with FORTRAN Accelerator and subroutines Legacy PGI Compilers	21	11922	August 17, 2011
Accelerating Standard C++ with GPUs Using stdpar Technical Blog	7	1365	July 28, 2023
Compiling Fortran code to run on rtx 4090 Legacy PGI Compilers	29	2161	July 26, 2024
OpenACC: Best way to parallelize nested DO loops with data dependency between loops? nvc, nvc++ and nvfortran	14	3277	October 4, 2021
Problem with NVFORTRAN and R nvc, nvc++ and nvfortran	46	2754	April 25, 2024
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13511	December 21, 2012
advice needed by a PhD student CUDA Programming and Performance	26	2853	December 4, 2011

Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK

Related topics