I’ve discussed your issue with several other of our application engineers and aren’t really sure what the problem is. It’s possible that if your parallel computation is very small but there are an extremely large number of threads, your time is being dominated by the overhead of creating and managing tasks.
One thing to try is to record your times at 1, 2, 4, and the Max number of threads. If the program doesn’t scale, i.e. the times are roughly the same, then this may indeed be the problem. The fix then would be to give each TASK more work.
If this isn’t the problem, we’ed need to see a reproducing example to tell what’s wrong.
Thanks for answering. The code I am testing has sufficient computation. I think this is a data scoping issue, I am expecting that variables which are shared in OMP PARALLEL would propagate to nested TASK construct as well. Updated code structure:
!$OMP PARALLEL SHARED(...) PRIVATE(...) FIRSTPRIVATE(...)
!$OMP SINGLE
!$OMP TASK UNTIED
CALL test(...)
!$OMP END TASK
!$OMP END SINGLE
!$OMP END PARALLEL
In the test function:
SUBROUTINE test(...)
...
DO i=i0,in
DO j=j0,jn
DO k=k0,kn
!$OMP TASK SHARED(...) FIRSTPRIVATE(...) !variables are either shared or firstprivate here
<embarrassingly parallel code>
!$OMP END TASK
ENDDO
ENDDO
ENDDO
!$OMP TASKWAIT
END SUBROUTINE test
The code runs indefinitely; in this type of code structure, do you have an advice to avoid scoping related pitfalls? Thank you.