Thread independence testing

Hello,

I’m checking a CUDA port of my legacy Fortran code to check for and ensure thread independence. I’ve been doing this by running in parallel on the GPU (using grid/block arguments of *,*) and comparing results to a GPU serial run (using grid/block arguments of 1,1). For reasons I dont understand, results for the following loop diverge between the parallel and serial codes. If anybody can see any issues that I dont see in the code below, please let me know. Any alternate suggestions would also be appreciated. Thanks!

!$cuf kernel do(3) <<< 1,1 >>>
DO K=2,KBM1
DO J=2,JMM1
DO I=2,IMM1
Q2B_d(I,J,K)=ABS(Q2B_d(I,J,K))
Q2LB_d(I,J,K)=ABS(Q2LB_d(I,J,K))
BOYGR_d(I,J,K)=GEE *(RHO_d(I,J,K-1)-RHO_d(I,J,K))/(DZZ_d(K-1) *DHF_d(I,J))
KN_d(I,J,K)=(KM_d(I,J,K) *.25 *SEF *( (U_d(I,J,K)-U_d(I,J,K-1)+U_d(I+1,J,K)-U_d(I+1,J,K-1))**2+(V_d(I,J,K)-V_d(I,J,K-1)+V_d(I,J+1,K)-V_d(I,J+1,K-1))**2 ) /(DZZ_d(K-1)*DHF_d(I,J))**2) + KH_d(I,J,K) *BOYGR_d(I,J,K)
BOYGR_d(I,J,K)=Q2B_d(I,J,K) *SQRT(Q2B_d(I,J,K))/(B1*Q2LB_d(I,J,K)+SMALL)
ENDDO
ENDDO
ENDDO

I’m not seeing anything obvious that would cause race conditions.

How different are the results? What optimization flags are you using?

Do you have a minimal reproducing example you can share?

The differences are random in time, do not accumulate in time, and on the order of about 10% (although I have been looking at outputs averaged over 180 timesteps).

Compilation flags are -Mextend -Msave -byteswapio -fastsse -Mpreprocess -Mconcur=nonuma -Mfixed -cuda -gpu=ccnative -gpu=managed -gpu=fastmath -gpu=fma -lcudart_static

The loop in my original post is one of about a 100 or so in the code. I have since discovered two more such loops.

I will aim to put together a simple example to share with you. Thanks for the offer of help!

I will aim to put together a simple example to share with you.

Sounds good.

Compilation flags are -Mextend -Msave -byteswapio -fastsse -Mpreprocess -Mconcur=nonuma -Mfixed -cuda -gpu=ccnative -gpu=managed -gpu=fastmath -gpu=fma -lcudart_static

Not that any of these are causing your issue, but they look like a set from the PGI days.

“-Msave” is a “big-hammer” flag that implicitly adds the “SAVE” attribute to local variables. This then will put the variables in static memory with their values being carried-over from call to call. Most codes don’t need it.

“-fastsse” is deprecated. Consider using the more up-to-date “-0fast”.

“-Mconcur=nonuma” this enables auto-parallelism on the host. May not be needed here.

Thanks for the suggestions. This is a code that has been in development/use for the past 40 years, and hence the carryover from PGI days. I also have the code set up so that the same executable can be used to run on CPU or GPU, depending on user selection, and hence I kept the -Mconcur=nonuma. However, eliminating -Msave and replacing -fastsse with -0fast did the trick - results from the parallel and serial GPU runs are identical now.

Your help is much appreciated!

Interesting. Most likely the “-Msave” flag was causing problems. All the local variables become global, so the compiler can’t implicitly privatize them. Granted, that shouldn’t really matter with this code snip-it since the scalars are read only, but there might be something else going on.

I any event, glad you got it working.