I have a large code that I am parallelizing with MPI + OpenMP + CUDA-Fortran. I am currently using v12.10 of pgf90, though other versions starting from 10.x have been used.
Different parts of the code is using nested openmp, and while it works fine with other compilers (ifort, gfortran, xlf), it seems that pgf90 refuses any nested openmp.
For example, here is a snippet illustrating the use of nested loop parallelization :
REAL FUNCTION wallclock()
integer, save:: count(2), count_rate=0
real, save:: norm, offset=0.
if (count_rate == 0) then
call system_clock(count=count(1), count_rate=count_rate)
norm=1./real(count_rate)
end if
call system_clock(count=count(2))
wallclock = (count(2)-count(1))*norm
if (wallclock < 0.) then
offset = offset + 24.*3600.
wallclock = wallclock + 24.*3600.
end if
END FUNCTION wallclock
Program Test_Nested_OpenMP
implicit none
integer, parameter :: n=80000000
integer :: i, j
integer, dimension(:,:), allocatable :: a, b
real :: t0,t1,t2
real, external :: wallclock
allocate(a(n,2), b(n,2))
a=0; b=0
t0 = wallclock()
!$omp parallel do collapse(2)
do j=1,2
do i=1,n
a(i,j)=sin(real(i+j))
enddo
enddo
t1 = wallclock()
print *, 'Number of elements :', n
print *, 'Time to initialize array :', t1-t0
print *, '----------------------------------------------------'
!$omp parallel do num_threads(2) shared(a,b) private(i,j)
do j=1,2
!$omp parallel shared(a,b,j) private(i)
!$omp do
do i=1,n
b(i,j) = sin(real(i+j))
enddo
!$omp enddo nowait
!$omp end parallel
enddo
!$omp end parallel do
t2 = wallclock()
print *, 'Time to do nested region :', t2-t1
END
Compiling and Executing with :
$ pgf90 -O2 -mp -Minfo test_nested_openmp.f90
test_nested_openmp:
26, Memory zero idiom, array assignment replaced by call to pgf90_mzero4
28, Parallel region activated
30, Parallel loop activated with static block schedule
33, Parallel region terminated
40, Parallel region activated
41, Parallel loop activated with static block schedule
43, Parallel region activated
47, Parallel region terminated
51, Parallel region terminated
$ env OMP_NUM_THREADS=4 OMP_MAX_ACTIVE_LEVELS=2 OMP_NESTED=true OMP_DYNAMC=true OMP_THREAD_LIMIT=4 taskset -c 0-3 ./a.out
I get
Number of elements : 80000000
Time to initialize array : 2.112338Time to do nested region : 3.674445
With other compilers (xlf, ifort, gfortran) the two times are equal. I have tried almost any variation of the OMP environment variables to no avail.
Is nested OpenMP not - or only partially - supported by the PGI compilers ?
best,
Troels