Nested OpenMP not supported in community edition?

Hello.

I have a PGI Visual Fortran license for development on PC, and using community edition on linux machine.

Nested OpenMP works fine in PVF, but only one thread is generated for nested parallel region in community edition.

Isn’t nested parallelism supported in PGI community edition?

Yes, Linux supports nested parallelism, and the community license
enables all the features on the Linux systems.

  1. Make sure you set OMP_NESTED,
    setenv OMP_NESTED TRUE

  2. Make sure you set the active levels.
    setenv OMP_MAX_ACTIVE_LEVELS 3

  3. at each level, a parallel section will enable as many threads
    as are set by OMP_NUM_THREADS

So if OMP_NUM_THREADS is set to 8, and OMP_MAX_ACTIVE_LEVELS is
set to 3, you will have problems, because
8**3 is greater than the thread limit of 256.

Here is a program with 3 levels of nesting.

Try OMP_NUM_THREADS set to 8, and then
OMP_MAX_ACTIVE_LEVELS to 3, and you will see problems.

pgfortran -Mfixed -mp -o hello_mp hello_mp.f
./hello_mp

% more hello_mp.f
program hello_mp
integer OMP_GET_THREAD_NUM, i,j
!$omp parallel
!$omp do
do i=1,8
j=OMP_GET_THREAD_NUM()

call hello_mp_sub(i,j)
end do
!$omp end parallel
end
subroutine hello_mp_sub(loopnum,threadnum)
integer OMP_GET_THREAD_NUM, loopnum,threadnum,i,j
!$omp parallel
!$omp do
do i=1,4
j=OMP_GET_THREAD_NUM()
call hello_mp_sub_sub(loopnum,threadnum,i,j)
end do
!$omp end parallel
end
subroutine hello_mp_sub_sub(lnum,tnum,l2num,t2num)
integer OMP_GET_THREAD_NUM, lnum,tnum,l2num,t2num,i,j
!$omp parallel
!$omp do
do i=1,2
j=OMP_GET_THREAD_NUM()
write(6,100)lnum,tnum,l2num,t2num,i,j
100 format(3x,“main lnum=”,i3,2x,“main tnum=”,i3,2x,
1 “2nd lnum=”,i3,2x,“2nd tnum=”,i3,2x,
1 “3nd lnum=”,i3,2x,“3nd tnum=”,i3,2x )
end do
!$omp end parallel
end

Isn’t nested parallelism supported in PGI community edition?

The Community Edition has the exact same features as the Professional Editions. The main difference is the support level and the number of available releases per year.

To enable OpenMP Nested Parallelism, you need to set the environment variable “OMP_NESTED” to true (the default is false). You may also want to use the environment variable “OMP_MAX_ACTIVE_LEVELS” to increase the maximum number of nested regions since the default is 1.

Hope this helps,
Mat

But I get the following message after execution:

OMP THREADS RESTRICTED
The PGI compiler license used to create this executable is for development
purposes only. The number of OpenMP threads is limited to 4 and nested
parallelism has been disabled. Please contact PGI Sales at > sales@pgroup.com
for information on lifting these restrictions.

License key that I’m using:

PGI Community Edition v16.10 license key file

Valid through: November 30 2017

NOTE: modifying this file may cause your license to fail.

PACKAGE PGI2016-COMMUNITY pgroupd 2016.1020 COMPONENTS=“pgfortran
pgcc pgc++ pvf pgprof pgdbg” OPTIONS=SUITE SIGN=“098A 7730
A3A2 4C4E 5E9D C09B D5B8 F718 1488 FBEE 0114 A9DE CF37 3A7B
4228 07ED 9BD5 6260 3429 1217 EB0D 9620 1B43 2160 7382 97A3
3097 CF57 EF7B 8F37”
FEATURE PGI2016-COMMUNITY pgroupd 2016.1020 30-nov-2017 uncounted
VENDOR_STRING=COMMUNITY:16:ws:accel:v16.10 HOSTID=ANY
PLATFORMS=“lin lin-acc win win-acc osx osx-acc”
SUITE_DUP_GROUP=UH TS_OK SIGN=“0C14 D5F3 ED81 829F CA03 E69B
2E7F 4C39 DBC0 B6C1 EA43 F10C 39E9 AA90 1209 08B5 C477 FFA3
356D D0C7 C800 6C52 E11C 3169 C18C 01D7 83D6 E7B8 1C23 614E”

This file is located in /opt/pgi.

I’m using 16.5 since 16.10 cannot compile my code.

The community edition license only applies to the 16.10 release.

The limits of older releases is unfortunate, but not because of the license.

What is not working with 16.10 that works with 16.5?

dave

Hello all,

I have the same problem regarding enabling NESTED. The output of my code with PGI compiler is different than GCC.

I am compiling with PGI Community Edition 18.10 and G++ 7.3 on Ubuntu 18.04.

g++ runs my code in a nested manner while PGI does not!


Compiling:
pgc++ -mp f1.cpp -o exe.pgi
g++ -fopenmp f1.cpp -o exe.g++


My code:

#include <stdio.h>
#include <omp.h>
#include <math.h>

double my_sleep(long n) {
	double sum = 0;
	for(int i=1;i<=n;i++) {
		sum += sqrt(1.0*i*i) + 1.0*i*i;
	}
	return sum;
}

int main() {

	omp_set_nested(1);

	#pragma omp parallel num_threads(3)
	{
		printf("====lvl 1 - thread count: %d\n", omp_get_num_threads());
		int id =  omp_get_thread_num();

		long sum = 0;
		int max_id = -1;
		#pragma omp parallel for reduction(+:sum) reduction(max:max_id) num_threads(6) 
		for(int i=0;i<100;i++) {

			max_id = omp_get_thread_num();
			
			if(omp_get_num_threads() > 1) {
				printf("====lvl 2 - thread count: %d\n", omp_get_num_threads());
			}
			sum += i;
			if(id == 0)
				my_sleep(10000000);
			else if(id == 1)
				my_sleep(100000);
			else if(id == 2)
				my_sleep(1000);
		}
		printf("in thread %2d - sum: %ld - max_id: %d\n", id, sum, max_id);
	}

	return 0;
}

Hi Millad,

Try using the PGI LLVM compiler instead.

pgc++ -Mllvm -mp f1.cpp -o exe.pgi

Since we’re moving towards using LLVM by default, most new development efforts with OpenMP are being added there.

-Mat

% pgc++ -mp nested.cpp -Mllvm
"nested.cpp", line 25: warning: variable "max_id" was set but never used
        for(int i=0;i<100;i++) {
        ^

% a.out
====lvl 1 - thread count: 3
====lvl 1 - thread count: 3
====lvl 1 - thread count: 3
====lvl 2 - thread count: 6
====lvl 2 - thread count: 6
====lvl 2 - thread count: 6
====lvl 2 - thread count: 6
====lvl 2 - thread count: 6
====lvl 2 - thread count: 6
.... continues ...
====lvl 2 - thread count: 6
====lvl 2 - thread count: 6
====lvl 2 - thread count: 6
====lvl 2 - thread count: 6
====lvl 2 - thread count: 6
in thread  0 - sum: 4950 - max_id: 5

Hi Mat,

That is great. Thank you so much!

It works right now.

Currently, I have set the main path of my compiler to “/opt/pgi/linux86-64/18.10”.
Is it safe to set it to linux86-64-llvm forever?
I am asking this because I have to use “LD_PRELOAD” to be able to run my code:

LD_PRELOAD=/opt/pgi/linux86-64-llvm/18.10/lib/libpgkomp.so ./exe.pgi

Without preload, it gives me following error:

./exe.pgi: error while loading shared libraries: /opt/pgi/linux86-64/18.10/lib/libpgkomp.so: file too short

I am setting my bashrc file as below for PGI:

PGI

PGI_DIR=/opt/pgi/linux86-64/18.10
export PATH=$PGI_DIR/bin/:$PATH
export LIBRARY_PATH=$PGI_DIR/lib/:$LIBRARY_PATH
export LD_LIBRARY_PATH=$PGI_DIR/lib/:$LD_LIBRARY_PATH

How can I make the -Mllvm flag always enabled?


On the website (https://www.pgroup.com/products/), it is said that PGI supports OpenMP 4.5. Is this correct? It gives me an error when I call team-related APIs in my code.


Thank you,
Millad

Is it safe to set it to linux86-64-llvm forever?

We’re moving towards making the LLVM compilers the default so the directory structure will change a bit in the future. Though, you’re good for 18.10.

Without preload, it gives me following error:

Yes, the LLVM compilers use the KMPC OpenMP runtime (i.e. libpgkomp.so) as opposed to our older PGI native runtime (libpgmp.so). Though the libpgkomp.so library under the native directory is a dummy library, so please change your LD_LIBRARY_PATH directory to the “linux86-64-llvm” path.

Use:

# PGI 
PGI_DIR=/opt/pgi/linux86-64-llvm/18.10 
export PATH=$PGI_DIR/bin/:$PATH 
export LIBRARY_PATH=$PGI_DIR/lib/:$LIBRARY_PATH 
export LD_LIBRARY_PATH=$PGI_DIR/lib/:$LD_LIBRARY_PATH

This will set your path to use the LLVM compilers by default, without the need to set “-Mllvm”.

On the website (> https://www.pgroup.com/products/> ), it is said that PGI supports OpenMP 4.5. Is this correct?

Yes, but there are limitations. For example, we don’t yet support offload and there’s one “team”.

https://www.pgroup.com/resources/docs/18.10/x86/pgi-release-notes/index.htm#openmp-chgs

-Mat

Thank you so much for your responses, Mat.

Thanks for confirming the number of teams (“one team”) at the end. That was one of my issues when using teams in my code.


Bests,
Millad

So, I am having another issue.

Suppose I have an OpenMP region. Within the OMP region, I have an OpenACC region that utilizes deviceptr (I have already allocated my memories on the device using cudaMalloc). Now, this setup causes the compiler to crash!

Here is an example code:

#include <stdio.h> 
#include <omp.h> 
#include <math.h> 

#include <cuda_runtime.h>

double my_sleep(long n) { 
   double sum = 0; 
   for(int i=1;i<=n;i++) { 
      sum += sqrt(1.0*i*i) + 1.0*i*i; 
   } 
   return sum; 
} 

int main() { 

   omp_set_nested(1); 
   double *d2;
   int N = 10000;
   double count = 0;
   #pragma omp parallel num_threads(4)
   {
      cudaMalloc((void**) &d2, sizeof(double)*N);

      #pragma acc parallel loop deviceptr(d2) 
      for(int i=0;i<N;i++) {
         d2[i] *= 2;
      }
      cudaFree(d2);
   }

   printf("count: %.2f\n", count);

   return 0; 
}

The compiler output:

main:
     23, Accelerator kernel generated
         Generating Tesla code
         26, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
PGCC/x86-64 Linux 18.10-0: compilation completed with warnings
/opt/pgi/linux86-64-llvm/18.10/share/llvm/bin/opt: /tmp/pgc++uFsc8eG9MV-h.ll:247:31: error: use of undefined value '%d2.addr'
        %32 = load double*, double** %d2.addr, align 8, !tbaa !61, !dbg !71
                                     ^

And, following is my compile command:

 pgc++ -m64 -std=c++11 -w -Mllvm -Mcuda -mp -acc -ta=tesla,multicore -Minfo=accel -O3  correct.cpp -o exe.pgi

If the OpenMP pragma is removed, the compilation is successful.


How can I fix this? I really need this capability: having OpenACC regions within OpenMP.


P.S.: the number lines at compiler’s output is always one line behind. For instance, when the OpenACC pragmas are on line 36 and then the loop is on line 37, then the compiler always reports that it parallelized the line 35!

Hi Millad,

“d2” needs to be private for each OpenMP thread. Otherwise, they’re sharing the same pointer.

#pragma omp parallel num_threads(4) private(d2)

Hope this helps,
Mat

Yes!!! It worked!

Thanks again, Mat.

But, why deviceptr should be sensitive to using a shared variable instead of a private one?
What deviceptr is basically doing is that it will tell the compiler that the address provided is available on the device. So, no need to be private, right? Just thinking out loud!


Millad

It doesn’t really have anything to do with the fact that this is a device pointer or even that OpenACC is being used. The same issue would occur if it was a host pointer being used in just the OpenMP parallel region.

Since the malloc and free are in a parallel region, every OMP thread will perform the malloc and free. So if “dp” is shared, the same pointer will be used by all of them, with the actual value of “dp” dependent on which ever thread did the allocation last. Plus you’d have a memory leak where the other threads allocated memory is inaccessible.

Worse, when the code tries to free the memory, the same address will be free’d multiple times, which can cause run time failures.

And for added fun, since all the threads are using the same address, you’d have a race condition.

If you really did want to have all the OpenMP threads share the same allocated device memory, then move the malloc and free outside of the parallel region. Something like:

   cudaMalloc((void**) &d2, sizeof(double)*N); 
   #pragma omp parallel num_threads(4) 
   { 
      #pragma acc parallel loop deviceptr(d2) 
      for(int i=0;i<N;i++) { 
         d2[i] *= 2; 
      } 
   } 
    cudaFree(d2);

Though the race condition would persist since the 4 device kernels on the device are all accessing and updating the same memory.

You can fix this by partitioning the device memory so each thread only accesses a section of the device memory.

Something like:

   cudaMalloc((void**) &d2, sizeof(double)*N); 
   #pragma omp parallel num_threads(4) 
   { 
      int thid=omp_get_thread_num();
      int block_size = (N+3)/4;
      int start = thid*(block_size);
      int end = start+block_size;
      if (end > N) end=N;
      #pragma acc parallel loop deviceptr(d2) 
      for(int i=start;i<end;i++) { 
         d2[i] *= 2; 
      } 
   } 
    cudaFree(d2);

-Mat

Thank you Mat for your discussion.

Yes. Honestly, I meant the second scenario where the allocation happens outside of the OpenMP region. I had that in mind when I sent the other post.

I totally agree with your first argument on declaring inside the openmp region (since each thread will allocate its own memory block).

But, I disagree with the second argument on declaring outside the openmp region. When we declare and allocate pointers outside the region and use that pointer with shared keyword within the region, as long as we access the pointer with array indexes (like d2_), then we should be OK since d2 is not modified (am I correct?). But, using your last code helps to prevent redoing the computations.

I think your last example (distributing work among threads) has a bug (at least on my system). The d2 variable should be decorated with “firstprivate” on the openmp region. Declaring as “shared” causes the compiler to crash again (with the same error).

Mat, this was a great discussion. Thank you.

One last comment I have is this: shouldn’t all of these issues be manifested at the run time? why compiler is crashing? I think the output of my program with errors should have been the reason for me to find the issue._

then we should be OK since d2 is not modified (am I correct?).

If d2 wasn’t modified, then, yes, it would be fine. But in the code you posted, it is modified with all threads iterating across the same set if values for “i”. This is a classic race condition.

      for(int i=0;i<N;i++) { 
         d2[i] *= 2; 
      }



I think your last example (distributing work among threads) has a bug (at least on my system).

Yes, that what I get for not actually testing code before posting. This does look like a LLVM code generation bug in our LLVM compiler so I’ve reported it as TPR#26762. We have a few other similar issues which may be related.

Note, the error doesn’t occur for me when compiling with our non-LLVM compilers so you might try using the compiler under the “$PGI/linux86-64/18.10” directory instead.

One last comment I have is this: shouldn’t all of these issues be manifested at the run time? why compiler is crashing?

Yes, it would. I was just avoiding the code generator bug thinking that it wouldn’t occur after fixing your code. My bad.

-Mat

Thank you, Mat.