Hi,
To avoid the communication overhead between CPU and GPU memories, I am thinking of writing portions of my code in ISO Fortran (DO CONCURRENT) to be run on multicore. Other portions that could take advantage of massive parallelization will be in OpenACC to be run on the GPU.
To achieve this I use the following compiler flags
-acc=gpu -gpu=managed -stdpar=multicore -Minfo=accel
On compiling the code
DO CONCURRENT (ispec=1:nspec) local(k,j,i,iglob,weight,jacobianl)
do k=1,NGLLZ
do j=1,NGLLY
do i=1,NGLLX
iglob = ibool(i,j,k,ispec)
weight = wxgll(i)*wygll(j)*wzgll(k)
jacobianl = jacobianstore(i,j,k,ispec)
rmass_acoustic(iglob) = rmass_acoustic(iglob) + jacobianl * weight / kappastore(i,j,k,ispec)
enddo
enddo
enddo
enddo
the follow is the output of -Minfo=accel
15507, Generating implicit copyin(jacobianstore(1:5,1:5,1:5,1:nspec),kappastore(1:5,1:5,1:5,1:nspec)) [if not already present]
Generating implicit copy(rmass_acoustic(:)) [if not already present]
Generating implicit copyin(wxgll(1:5),wzgll(1:5),wygll(1:5),ibool(:,:,:,:nspec)) [if not already present]
Generating Multicore code
15507, Loop parallelized across CPU threads
15508, Loop carried dependence due to exposed use of rmass_acoustic(:) prevents parallelization
15509, Loop carried dependence due to exposed use of rmass_acoustic(:) prevents parallelization
15510, Complex loop carried dependence of rmass_acoustic prevents parallelization
Loop carried dependence due to exposed use of rmass_acoustic(:) prevents parallelization
Inner sequential loop scheduled on accelerator
Generating NVIDIA GPU code
15507, Loop parallelized across CUDA thread blocks, CUDA threads(128) blockidx%x threadidx%x
15508, Loop run sequentially
15509, Loop run sequentially
15510, Loop run sequentially
15510, Complex loop carried dependence of rmass_acoustic prevents parallelization
What I do not understand is why it is generating GPU code. Any thoughts?
Cheers,
Jyoti