OpenACC on GPU and ISO Fortran on multicore

jbehura07 · September 21, 2023, 4:43am

Hi,

To avoid the communication overhead between CPU and GPU memories, I am thinking of writing portions of my code in ISO Fortran (DO CONCURRENT) to be run on multicore. Other portions that could take advantage of massive parallelization will be in OpenACC to be run on the GPU.
To achieve this I use the following compiler flags

-acc=gpu -gpu=managed -stdpar=multicore -Minfo=accel

On compiling the code

				DO CONCURRENT (ispec=1:nspec) local(k,j,i,iglob,weight,jacobianl)
                    do k=1,NGLLZ
                    do j=1,NGLLY
                    do i=1,NGLLX
                        iglob = ibool(i,j,k,ispec)

                        weight = wxgll(i)*wygll(j)*wzgll(k)
                        jacobianl = jacobianstore(i,j,k,ispec)

                        rmass_acoustic(iglob) = rmass_acoustic(iglob) + jacobianl * weight / kappastore(i,j,k,ispec)
                    enddo
                    enddo
                    enddo
				enddo

the follow is the output of -Minfo=accel

  15507, Generating implicit copyin(jacobianstore(1:5,1:5,1:5,1:nspec),kappastore(1:5,1:5,1:5,1:nspec)) [if not already present]
         Generating implicit copy(rmass_acoustic(:)) [if not already present]
         Generating implicit copyin(wxgll(1:5),wzgll(1:5),wygll(1:5),ibool(:,:,:,:nspec)) [if not already present]
         Generating Multicore code
      15507, Loop parallelized across CPU threads
  15508, Loop carried dependence due to exposed use of rmass_acoustic(:) prevents parallelization
  15509, Loop carried dependence due to exposed use of rmass_acoustic(:) prevents parallelization
  15510, Complex loop carried dependence of rmass_acoustic prevents parallelization
         Loop carried dependence due to exposed use of rmass_acoustic(:) prevents parallelization
         Inner sequential loop scheduled on accelerator
         Generating NVIDIA GPU code
      15507, Loop parallelized across CUDA thread blocks, CUDA threads(128) blockidx%x threadidx%x
      15508, Loop run sequentially 
      15509, Loop run sequentially 
      15510, Loop run sequentially 
  15510, Complex loop carried dependence of rmass_acoustic prevents parallelization

What I do not understand is why it is generating GPU code. Any thoughts?

Cheers,
Jyoti

MatColgrove · September 21, 2023, 10:47pm

Hi Jyoti,

Interesting use case that I haven’t seen before, nor have our engineers likely thought about. Given our DO CONNCURRENT is built on top of OpenACC, it seems that adding “-acc=gpu” will cause DO CONNCURRENT to be offload as well. The actually binary will contain both multicore and gpu versions, but both models will either offload to the GPU or run multicore.

One work around would be to call “acc_set_device_type(acc_device_host)” before the DO CONNCURRENT loops to have them use the multicore version, and “acc_set_device_type(acc_device_nvidia)” before the OpenACC loops to run on the GPU. Not ideal, but should work.

Though, instead of DO CONCURRENT you can use OpenMP host parallelization to do the same thing.

I added a report, TPR #34318, and will ask engineering to take a look.

Thanks,
Mat

jbehura07 · September 22, 2023, 5:21am

Hi Mat,

I will give the “acc_set_device_type” a try.

I did think of using OpenMP for multicore parallelization, but given the multiple advantages of standard Fortran, I am leaning towards DO CONCURRENT.

As always, I very much appreciate your input.

Cheers,
Jyoti

system · October 6, 2023, 5:21am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to insert openacc multicore parallel loops into nvfortran GPU code nvc, nvc++ and nvfortran	6	116	April 23, 2025
Fortran OpenACC fallback to OpenMP if there is no GPU nvc, nvc++ and nvfortran	3	787	November 2, 2020
[Fortran][do concurrent] Questions regarding compile options for managing offloading and performance nvc, nvc++ and nvfortran cuda	3	779	May 2, 2023
DO LOOP inside DO CONCURRENT nvc, nvc++ and nvfortran	4	561	December 30, 2020
Do concurrent with gpu or multicore nvc, nvc++ and nvfortran	4	282	July 8, 2024
Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK Technical Blog	28	2731	February 25, 2025
Multi-GPU Fortran OpenACC and OpenMP Legacy PGI Compilers	2	2692	October 26, 2018
Using Fortran Standard Parallel Programming for GPU Acceleration Technical Blog	8	796	July 11, 2024
Does nvfortran -stdpar=gpu support two GPUs with NVLink? nvc, nvc++ and nvfortran	12	203	April 2, 2025
Under Nvfortran 25.3 -stdpar=gpu -acc=gpu -gpu=mem:separate -O3 is still slow nvc, nvc++ and nvfortran	6	120	July 26, 2025

OpenACC on GPU and ISO Fortran on multicore

Related topics