Hello everyone, I have encountered an issue. The same Fortran code runs on the GPU when I run it standalone, but when I place it into a larger project, it automatically runs on the CPU instead of the GPU. The code is exactly the same. Why does it default to CPU execution when it is part of a larger project?
This is the code.
!$acc data copyin(BT_11, BT_22, BT_33) copyout(result1, result2)
!$acc parallel loop gang vector collapse(2) private(sub1, sub2, sub3, istart2, jstart2, max_cc1, max_cc2) firstprivate(sbox_width,bbox_width)
DO iline = 50, yn_amv-50!229
DO ielem = 50, xn_amv-50!229
max_cc1 = -999
max_cc2 = -999
istart2 = (ielem-1) * sbox_width/2 + 1
jstart2 = (iline-1) * sbox_width/2 + 1
sub2 = BT_22(istart2:istart2+sbox_width-1, jstart2:jstart2+sbox_width-1)
do x = 1, bbox_width - sbox_width + 1
do y = 1, bbox_width - sbox_width + 1
istart1 = max(1, istart2 - (bbox_width-sbox_width)/2 + x - 1)
jstart1 = max(1, jstart2 - (bbox_width-sbox_width)/2 + y - 1)
if (istart1 + sbox_width - 1 <= 2748 .and. jstart1 + sbox_width - 1 <= 2748) then
sub1 = BT_11(istart1:istart1+sbox_width-1, jstart1:jstart1+sbox_width-1)
sub3 = BT_33(istart1:istart1+sbox_width-1, jstart1:jstart1+sbox_width-1)
correlation1 = get_matrix_correlation_coef_f(sbox_width, sbox_width, sub2, sub1)
correlation2 = get_matrix_correlation_coef_f(sbox_width, sbox_width, sub2, sub3)
if (max_cc1 < correlation1) then
max_cc1 = correlation1
end if
if (max_cc2 < correlation2) then
max_cc2 = correlation2
end if
end if
end do
end do
result1(ielem, iline) = max_cc1
result2(ielem, iline) = max_cc2
print*,'xx',ielem, iline,max_cc1,max_cc2
END DO
END DO
!$acc end parallel
!$acc end data
this is compile result:
412, Generating copyin(bt_11(:,:),bt_22(:,:),bt_33(:,:)) [if not already present]
Generating copyout(result2(:,:),result1(:,:)) [if not already present]
413, Generating implicit firstprivate(yn_amv,xn_amv)
Generating NVIDIA GPU code
414, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
415, ! blockidx%x threadidx%x collapsed
421, !$acc loop seq
422, !$acc loop seq
Generating implicit reduction(max:max_cc2,max_cc1)
423, !$acc loop seq
Generating implicit reduction(max:max_cc2,max_cc1)
429, !$acc loop seq
413, Local memory used for sub1,sub3,sub2
415, Generating implicit firstprivate(x)
421, Loop is parallelizable
422, Loop carried reuse of sub1 prevents parallelization
Complex loop carried dependence of sub1 prevents parallelization
Loop carried reuse of sub3 prevents parallelization
Complex loop carried dependence of sub3,sub1 prevents parallelization
Generating implicit firstprivate(y)
423, Loop carried reuse of sub1,sub3 prevents parallelization
Generating implicit firstprivate(correlation2,jstart1,istart1,correlation1)
Loop carried reuse of sub3 prevents parallelization
429, Loop is parallelizable
get_matrix_correlation_coef_f:
537, Generating implicit acc routine seq
Generating acc routine seq
Generating NVIDIA GPU code