Hello everybody,
I’ve got a procedure that I want to partially execute on the GPU. I’ve already succesfully ported some loops taking care of copying essential arrays.
One loop after porting to GPU takes approximately 25 times longer than executed on CPU. The CPU code looks like this:
do kk=1,igfy
hmatrix(kk,igfy)=zero
do nbl=1,nblcks
do n=ijklim(nbl,1),nijkpr(nbl)
ijk=ijkpr(n)
hmatrix(kk,igfy)=hmatrix(kk,igfy)+
& vvect(ijk,igfyp1)*vvect(ijk,kk)
enddo
enddo
do nbl=1,nblcks
do n=ijklim(nbl,1),nijkpr(nbl)
ijk=ijkpr(n)
vvect(ijk,igfyp1)=vvect(ijk,igfyp1)
& -hmatrix(kk,igfy)*vvect(ijk,kk)
enddo
enddo
enddo
To make it executable on GPU I’ve changed the code like this:
(630) !$acc region local(kk)
(631) do kk=1,igfy
(632) hmatrix(kk,igfy)=zero
(633) enddo
(634)
(635) do nbl=1,nblcks
(636) do kk=1,igfy
(637) do ijk=imoj4,imoj5
(638) hmatrix(kk,igfy)=hmatrix(kk,igfy)+
(639) & vvect(ijk,igfyp1)*vvect(ijk,kk)
(640) enddo
(641) enddo
(642) enddo
(643)
(644) do nbl=1,nblcks
(645) do ijk=imoj4,imoj5
(646) do kk=1,igfy
(647) vvect(ijk,igfyp1)=vvect(ijk,igfyp1)
(648) & -hmatrix(kk,igfy)*vvect(ijk,kk)
(649) enddo
(650) enddo
(651) enddo
(652)!$acc end region
Compilation log:
630: region entered 3990 times
time(us): total=74000000
kernels=36032074 data=49510
631: kernel launched 3990 times
grid: [1] block: [256]
time(us): total=> 28628 > max=140 min=4 avg=7
637: kernel launched 3990 times
grid: [1] block: [32]
time(us): total=> 35830347 > max=11623 min=6040 avg=8980
646: kernel launched 3990 times
grid: [72] block: [256]
time(us): total=> 173099 > max=131 min=15 avg=43
I’ve tried also to switch lines 636 and 637:
do ijk=imoj4,imoj5
do kk=1,igfy
but with the same results.
Why the loop on line 637 can take ~2000 times longer that the loop on line 646? Any ideas?
Thanks!