Hello,
Hope you are having a wonderful week.
Can you kindly help me in understanding why does my Minfo change between running the code alone and using that specific code as a subroutine of a bigger program. It makes use of OpenACC directives like this:
ALLOCATE( multtwobdmat(3 * n, 3 * nbeams))
!$acc data copyin(twobdmat,kbeam) copyout(multtwobdmat)
!$acc parallel loop
Do i = 1 , 3 * n
Do j = 1 , 3 * nbeams
!$acc loop
Do ii = 1 , 3 * nbeams
multtwobdmat(i,j)=multtwobdmat(i,j)+twobdmat(i,ii)*kbeam(ii,j)
end do
end do
end do
!$acc end data
ALLOCATE( multtwothreebdmat(3 * ncol, 3 * nbeams))
!$acc data copyin(twothreebdmat,kbeam) copyout(multtwothreebdmat)
!$acc parallel loop
Do i = 1 , 3 * ncol
Do j = 1 , 3 * nbeams
!$acc loop
Do ii = 1 , 3 * nbeams
multtwothreebdmat(i,j) = multtwothreebdmat(i,j)+ (twothreebdmat(i,ii)) * (kbeam(ii,j))
end do
end do
end do
!$acc end data
When the code used to compile alone with the following flags,
pgfortran -o GENA213.exe GENA213.cuf -fast -Minfo=opt -ta:tesla:cc50 -Minfo=accel -lcula_lapack_pgfortran
it used to compile like this:
....
129, Zero trip check eliminated
138, Generating copyout(multtwobdmat(:,:))
Generating copyin(twobdmat(:,:),kbeam(:,:))
140, Accelerator kernel generated
Generating Tesla code
141, !$acc loop gang ! blockidx%x
142, !$acc loop seq
144, !$acc loop vector(128) ! threadidx%x
142, Loop is parallelizable
144, Loop is parallelizable
153, Generating copyin(twothreebdmat(:,:))
Generating copyout(multtwothreebdmat(:,:))
Generating copyin(kbeam(:,:))
155, Accelerator kernel generated
Generating Tesla code
156, !$acc loop gang ! blockidx%x
157, !$acc loop seq
159, !$acc loop vector(128) ! threadidx%x
157, Loop is parallelizable
159, Loop is parallelizable
....
[/code]
Now, once I placed it in a subroutine and now compiling it with the following:
pgfortran -Mcuda -Mlarge_arrays -o PL.exe PL.for -fast -ta:tesla:cc50 -acc -Minfo=all -lcula_lapack_pgfortran
I am getting this:
...
7910,Loop not fused: function call before adjacent loop
Generated an alternate version of the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
FMA (fused multiply-add) instruction(s) generated
7921, Loop not fused: function call before adjacent loop
7924, Zero trip check eliminated
Generated an alternate version of the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
FMA (fused multiply-add) instruction(s) generated
...
I have not changed anything in the code, except that the first code is “.cuf” and the second is “.for”. Is it possible that this is the difference? Did the compiler understand the directives and just not showing me the messages?
Can you please help me understand those messages produced the second time? Things like “Loop not fused: function call before adjacent loop” and “Loop interchange produces reordered loop nest: 7938,7940,7937”
I cannot find any source online that can help me understand these messages.
Thank you for your time.
Ahmed