Hello,
I’m checking a CUDA port of my legacy Fortran code to check for and ensure thread independence. I’ve been doing this by running in parallel on the GPU (using grid/block arguments of *,*) and comparing results to a GPU serial run (using grid/block arguments of 1,1). For reasons I dont understand, results for the following loop diverge between the parallel and serial codes. If anybody can see any issues that I dont see in the code below, please let me know. Any alternate suggestions would also be appreciated. Thanks!
!$cuf kernel do(3) <<< 1,1 >>>
DO K=2,KBM1
DO J=2,JMM1
DO I=2,IMM1
Q2B_d(I,J,K)=ABS(Q2B_d(I,J,K))
Q2LB_d(I,J,K)=ABS(Q2LB_d(I,J,K))
BOYGR_d(I,J,K)=GEE *(RHO_d(I,J,K-1)-RHO_d(I,J,K))/(DZZ_d(K-1) *DHF_d(I,J))
KN_d(I,J,K)=(KM_d(I,J,K) *.25 *SEF *( (U_d(I,J,K)-U_d(I,J,K-1)+U_d(I+1,J,K)-U_d(I+1,J,K-1))**2+(V_d(I,J,K)-V_d(I,J,K-1)+V_d(I,J+1,K)-V_d(I,J+1,K-1))**2 ) /(DZZ_d(K-1)*DHF_d(I,J))**2) + KH_d(I,J,K) *BOYGR_d(I,J,K)
BOYGR_d(I,J,K)=Q2B_d(I,J,K) *SQRT(Q2B_d(I,J,K))/(B1*Q2LB_d(I,J,K)+SMALL)
ENDDO
ENDDO
ENDDO