Ok I think I figured out what the issue was and got it working.
I had to make the subroutine part of a module then use that module in the main program.
This is a run comparison between using “reflected” and not (matrices A and B get multiplied 10 times):
#With reflected (data movement is around 3.5 seconds)
[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.60642 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.32226 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.33978 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.31944 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.32152 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.34007 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.32129 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.32207 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.33026 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.32959 s
Accelerator Kernel Timing data
reflected.f
mm
46: region entered 10 times
time(us): total=363552644 init=4 region=363552640
kernels=360440121 data=3037902
w/o init: total=363552640 max=36606408 min=36319440 avg=36355264
48: kernel launched 10 times
grid: [625x625] block: [16x16]
time(us): total=165389 max=16550 min=16533 avg=16538
52: kernel launched 10 times
grid: [625x625] block: [16x16]
time(us): total=360274732 max=36028468 min=36025824 avg=36027473
reflected.f
main
23: region entered 1 time
time(us): total=365187413 init=1079992 region=364107421
data=541007
w/o init: total=364107421 max=364107421 min=364107421 avg=364107421
#Without reflected (data movement is around 8.5 seconds)
[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 38.23182 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.88260 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.89074 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.88908 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.87273 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.89082 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.89038 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.89151 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.89142 s
time for C( 10000 , 10000 ) = A( 10000 , 10000
) B( 10000 , 10000 ) is 36.88925 s
Accelerator Kernel Timing data
reflected.f
main
25: region entered 10 times
time(us): total=370220259 init=1085643 region=369134616
kernels=360431990 data=8507312
w/o init: total=369134616 max=37146134 min=36872727 avg=36913461
25: kernel launched 20 times
grid: [625x625] block: [16x16]
time(us): total=360431990 max=36027676 min=16539 avg=18021599
So 3.5 v.s. 8.5 seconds is around 58% cut in data movement.
Here’s the code if anyone else is interested to look at it:
[sindimo@slcb100 working-fortran-example-with-gpu]$ cat reflected.f
program main
use myModule
use accel_lib
integer dim1, dim2, dim3, seed
parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)
!populate 2 random matrices
seed=7654321
do i = 1, dim1
do j = 1, dim2
A(i, j) = ran(seed)
enddo
enddo
do i = 1, dim2
do j = 1, dim3
B(i, j) = ran(seed)
enddo
enddo
!Trying to multiple the 2 matricies several times (only load them once into the GPU memory)
!$acc data region copyin(A,B)
do i = 1, 10
call MM(A,B,C)
enddo
!$acc end data region
end program main
module myModule
contains
subroutine MM (X,Y,Z)
integer dim1, dim2, dim3
parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
double precision X(dim1, dim2), Y(dim2, dim3), Z(dim1, dim3)
real start, finish
!$acc reflected(X,Y)
call cpu_time(start)
!$acc region
do j = 1, dim3
do i = 1, dim1
Z(i, j) = 0
enddo
do k = 1, dim2
do i = 1, dim1
Z(i, j) = Z(i, j) + X(i, k)*Y(k, j)
enddo
enddo
enddo
!$acc end region
call cpu_time(finish)
print *,'time for C(',dim1,',',dim3,') = A(',dim1,',',dim2,') B(',
1dim2,',',dim3,') is',finish - start,' s'
end subroutine MM
end module myModule
I hope others find this useful.
Mohamad Sindi