Nested parallelism using the ACML

Hi

I am writing a code for use on a quad socket Opteron 6100 system, where I want to exploit nested parallelism in order to utilize the full bandwidth of the ccNUMA architecture. The idea that I want to implement is shown below, but it fails utilizing nested parallelism in calls to the ACML. If I remove the parallel region of the example, the ACML call itself will utilize multiple CPU’s, but as soon as I add the outer parallel region it starts to run single threaded. What do I have to do to extract parallelism from both places at the same time?

I am using PVF 13.2 with VS2010, executing on Windows 2008 R2.

Best regards,

Casper

program prog
implicit none
integer :: i,j,NRHS,LDB,N=1000,M=200
Complex*16 :: A(N,N),B(N,1)
integer :: iPiv(N)
integer :: info
call omp_set_nested(1)
call omp_set_dynamic(1)
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(N) NUM_THREADS(8)
!$OMP DO
!We want to parallelize a program that does a boatload (M) of succesive calls to ZGETRF
do j=1,M
!Fill a dummy matrix for the example
A(:,:)=0d0
do i=1,N
A(i,i)=1d0
end do
!I want to use 6 threads for each of the acml calls - eg. 8x6 threads in total
!But i cannot get any parallelism out of the following ACML call with the outer parallel region enabled.
call omp_set_num_threads(6)
CALL ZGETRF( N, N, A, N, IPIV, INFO )
end do
!$OMP END DO
!$OMP END PARALLEL

end program prog

No suggestions? :-(

Hi Casper,

We’re not sure if AMD’s ACML supports nested parallelism. Though, in addition to setting OMP_NESTED, you may need to set the environment variable “OMP_MAX_ACTIVE_LEVELS=2” as well. Give that a try.

  • Mat