OpenACC on GPU and ISO Fortran on multicore

Hi,

To avoid the communication overhead between CPU and GPU memories, I am thinking of writing portions of my code in ISO Fortran (DO CONCURRENT) to be run on multicore. Other portions that could take advantage of massive parallelization will be in OpenACC to be run on the GPU.
To achieve this I use the following compiler flags

-acc=gpu -gpu=managed -stdpar=multicore -Minfo=accel

On compiling the code

				DO CONCURRENT (ispec=1:nspec) local(k,j,i,iglob,weight,jacobianl)
                    do k=1,NGLLZ
                    do j=1,NGLLY
                    do i=1,NGLLX
                        iglob = ibool(i,j,k,ispec)

                        weight = wxgll(i)*wygll(j)*wzgll(k)
                        jacobianl = jacobianstore(i,j,k,ispec)

                        rmass_acoustic(iglob) = rmass_acoustic(iglob) + jacobianl * weight / kappastore(i,j,k,ispec)
                    enddo
                    enddo
                    enddo
				enddo

the follow is the output of -Minfo=accel

  15507, Generating implicit copyin(jacobianstore(1:5,1:5,1:5,1:nspec),kappastore(1:5,1:5,1:5,1:nspec)) [if not already present]
         Generating implicit copy(rmass_acoustic(:)) [if not already present]
         Generating implicit copyin(wxgll(1:5),wzgll(1:5),wygll(1:5),ibool(:,:,:,:nspec)) [if not already present]
         Generating Multicore code
      15507, Loop parallelized across CPU threads
  15508, Loop carried dependence due to exposed use of rmass_acoustic(:) prevents parallelization
  15509, Loop carried dependence due to exposed use of rmass_acoustic(:) prevents parallelization
  15510, Complex loop carried dependence of rmass_acoustic prevents parallelization
         Loop carried dependence due to exposed use of rmass_acoustic(:) prevents parallelization
         Inner sequential loop scheduled on accelerator
         Generating NVIDIA GPU code
      15507, Loop parallelized across CUDA thread blocks, CUDA threads(128) blockidx%x threadidx%x
      15508, Loop run sequentially 
      15509, Loop run sequentially 
      15510, Loop run sequentially 
  15510, Complex loop carried dependence of rmass_acoustic prevents parallelization

What I do not understand is why it is generating GPU code. Any thoughts?

Cheers,
Jyoti

Hi Jyoti,

Interesting use case that I haven’t seen before, nor have our engineers likely thought about. Given our DO CONNCURRENT is built on top of OpenACC, it seems that adding “-acc=gpu” will cause DO CONNCURRENT to be offload as well. The actually binary will contain both multicore and gpu versions, but both models will either offload to the GPU or run multicore.

One work around would be to call “acc_set_device_type(acc_device_host)” before the DO CONNCURRENT loops to have them use the multicore version, and “acc_set_device_type(acc_device_nvidia)” before the OpenACC loops to run on the GPU. Not ideal, but should work.

Though, instead of DO CONCURRENT you can use OpenMP host parallelization to do the same thing.

I added a report, TPR #34318, and will ask engineering to take a look.

Thanks,
Mat

Hi Mat,

I will give the “acc_set_device_type” a try.

I did think of using OpenMP for multicore parallelization, but given the multiple advantages of standard Fortran, I am leaning towards DO CONCURRENT.

As always, I very much appreciate your input.

Cheers,
Jyoti

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.