Hi,
The following code causes a segfault using PGI 17.7 with “-ta:tesla,cc50,cuda8.0,managed,deepcopy”:
!$acc kernels
do k=1,np
v%r(:,1,k)=sum0(:)-v%r(:,2,k)
v%t(:,1,k)=sumc0(:)*sph(k)-sums0(:)*cph(k)
enddo
do k=1,npm1
v%p(:,1,k)= two*( sums0(:)*sp(k)+sumc0(:)*cp(k) )
& -v%p(:,2,k)
enddo
!$acc end kernels
The first dimension of v%r and v%t are different by 1.
I was eventually able to get this to work, but I had to explicitly separate the loops, and preload the arrays to the device as follows:
!$acc parallel present(sph,cph,sp,cp,v,sums0,sum0,sumc0)
!$acc loop gang worker
do k=1,np
!$acc loop vector
do i=1,nrm
v%r(i,1,k)=sum0(i)-v%r(i,2,k)
enddo
!$acc loop vector
do i=1,nr
v%t(i,1,k)=sumc0(i)*sph(k)-sums0(i)*cph(k)
enddo
enddo
!$acc loop gang worker
do k=1,npm1
!$acc loop vector
do i=1,nr
v%p(i,1,k)= two*( sums0(i)*sp(k)+sumc0(i)*cp(k) )
& -v%p(i,2,k)
enddo
enddo
!$acc end parallel
I would prefer to not have to change the original compact code.
Do you know what part of the first code was causing the issue? Is it an intrinsic “bad” loop for kernels, or is it a compiler compatibility issue?