Hi,
I have some troubles with my program. The structure of my code is the following(in fortran):
main function
.
call function1(input arrays)
.
end main
function1(input arrays)
.
!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels
!Then follow about 5 or 6 1D loops
!$acc loop
do i=1.etc
.
.
.
!$acc loop
do i=1.etc
...
!And here is the problem.The first 2D loop
!$acc loop independent gang
do i=1,N
!$acc loop independent gang vector
do j=1,M
independent calculations..
enddo
enddo
!and then again follow 1D loops
!$acc loop
do i=1.etc
.
.
.
!$acc end kernels
!$acc end data
end function1
My problem is that when i compile my code, i get the correct parallelization for all the 1D loops,
for example :
108, Loop is parallelizable
Accelerator kernel generated
108, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
but the compiler for the 2D loop gives me the same parallelization
for example i expect some thing like:
57, Loop is parallelizable <–REFERS TO I
59, Loop is parallelizable ← REFERS TO J
Accelerator kernel generated
57, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
59, !$acc loop gang ! blockidx%y
CC 1.3 : 35 registers; 100 shared, 8 constant, 0 local memory bytes
CC 2.0 : 38 registers; 0 shared, 156 constant, 0 local memory bytes
but i get
189, Loop is parallelizable <–REFERS TO I
Accelerator kernel generated
189, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
CC 1.3 : 30 registers; 112 shared, 56 constant, 0 local memory bytes
CC 2.0 : 41 registers; 0 shared, 276 constant, 0 local memory bytes
191, Loop is parallelizable ← REFERS TO J
And when i finally take my time analysis for my program, i see that this 2D loop is not in parallel but sequantial.
Now the strange part. If a use OpenACC only to the part with the 2D loop,
for example:
function1(input arrays)
.
do i=1.etc
.
.
.
!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels
!$acc loop independent
do i=1,N
!$acc loop independent
do j=1,M
independent calculations..
enddo
enddo
!$acc end kernels
!$acc end data
do i=1.etc
.
.
.
end function1
I get the correct parallelization. I can’t figure out what is going.
Thanks, Sotiris