Hello, I have a CFD solver which seems working ok in serial mode (-acc=host ) but when run in parallel (-acc=multicore), the result is a random mess. The aim is to run it in gpu mode (-acc -gpu=cc75 ) when its working ok in parallel.
There is just one massive (i,j,k) loop at the moment, but I’m not sure if I need to split it into two i,j,k loops, or if the data copy needs changing. It appears that the data copy not working - at least that’s my best guess. So, are there any suggestions on how to fix it, or help diagnose what the problem is please? Thanks.
!$acc data copy(r,u,v,w,p,t,visc)
DO it = 1,nits
!$acc kernels loop independent
do k = 1,km
!$acc loop independent
do j = 1,jm
!$acc loop independent
do i = 1,im
The data regions shouldn’t matter here given you’re targeting multicore CPU where they are ignored. Data regions are only enabled when there’s different memories, i.e. GPU and host.
I’d look for race conditions in your code. If you take “independent” off, does the compiler flag potential dependencies?
Also, if you provide the code for the full loop, I might be able spot potential issues.
Arrays are shared by default and scalars are firstprivate by default. There are exceptions where scalars could become global where you need to add them to a private clause, but the compiler feedback messages (i.e. add -Minfo=accel) will give info if this is the case.
but very slow (slower than serial CPU) with
“parallel loop” and “kernels loop independent” are roughly equivalent but do use different planners so can generate different schedules. But here it’s likely a difference between using collapse(3) or explicitly setting the loop directive on each loop. Which is better will depend on which schedule (gang, worker, vector) is being applied, the loop trip counts, and which loop index corresponds to the stride-1 dimension of the array.
My best guess here is that since private only applies to the loop where it’s placed, the inner loop may not be getting parallelize since doing so would cause a race condition. Though you’d need to look at the -Minfo messages to see.
If you collapse the outer kernels directive, you’ll like get similar times as the parallel case. You can also try moving the private clause to the innermost loop to see if that helps.