I am writing a code for use on a quad socket Opteron 6100 system, where I want to exploit nested parallelism in order to utilize the full bandwidth of the ccNUMA architecture. The idea that I want to implement is shown below, but it fails utilizing nested parallelism in calls to the ACML. If I remove the parallel region of the example, the ACML call itself will utilize multiple CPU’s, but as soon as I add the outer parallel region it starts to run single threaded. What do I have to do to extract parallelism from both places at the same time?
I am using PVF 13.2 with VS2010, executing on Windows 2008 R2.
integer :: i,j,NRHS,LDB,N=1000,M=200
Complex*16 :: A(N,N),B(N,1)
integer :: iPiv(N)
integer :: info
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(N) NUM_THREADS(8)
!We want to parallelize a program that does a boatload (M) of succesive calls to ZGETRF
!Fill a dummy matrix for the example
!I want to use 6 threads for each of the acml calls - eg. 8x6 threads in total
!But i cannot get any parallelism out of the following ACML call with the outer parallel region enabled.
CALL ZGETRF( N, N, A, N, IPIV, INFO )
!$OMP END DO
!$OMP END PARALLEL
end program prog