I have tried searching this forum for answers with no success.
I am using POSIX threads to drive OpenMP, with the intention to use a single POSIX thread to drive CPU code with loop-based OpenMP parallelism, and a separate POSIX thread to drive GPU code via OpenMP. POSIX thread synchronization is done with pipes and select().
At present I am making both POSIX threads drive CPU OpenMP parallelism before proceeding with target offloading. With GCC this works. With the nvc compiler, I get indications that segmentation faults take place. Valgrind does not trap any issues. Code was tested on an IBM Power9 system, on a Linux Epyc, and on a Linux Intel system. Results are consistently indicating that nvc acts problematically, while GCC works as expected.
I should mention that I because of how OpenMP’s API sets variables such as the number of threads, etc, I do not do anything too fancy. In fact, I tried forcing the use of a single CPU thread in OpenMP for one of the POSIX threads, and this did not help. I understand how there can be a problem with two separate OpenMP CPU parallel regions executing in a single memory space. But I have no means of going after more diagnostics. Would profiling help? Valgrind and gdb alter the runtime execution sufficiently to make the problem look like it is not there. However, I see some race condition, as my numerical result produces NaNs with nvc.
Apologies for the lengthy and convoluted question; if it were simple, I would not be asking here!
Can you provide a minimal reproducing example of what you’re doing?
Personally I’ve never tried mixing pthreads with OpenMP. It seems problematic given on POSIX based systems most OpenMP implementations use pthreads already. There may be unsafe situations such as killing a pthread before all the OpenMP threads have exited in that region. Given we make our OpenMP threads persistent, this could indeed be the issue.
Though if you can get me an example, I’ll look into and hopefully get you a better answer.
Thanks for looking at this. I would share a concise version of the code, but it is not worth it for two reasons, the first being that it would require some time to do so. But the second is likely the solution to this, as I will attempt to explain.
When using the nvidia compilers, the code fails, and it likely should. Considering that the OpenMP pragmas are opaque in what they do, lacking knowledge of exactly how the compiler structures the threading I am uncertain as to whether it creates per-parallel-region constructs or if it uses some global constructs. The latter appears to be the case.
We do not have guarantees that the OpenMP specification requires that several CPU OpenMP parallel regions are to be executed without being connected in some manner. And here I mean by specifying all the parallelization within a single block of code.
I decided to test this hypothesis. I had the POSIX thread that is supposed to do OpenMP threading proceed as expected, and I had the second POSIX thread proceed without it. I am now intending to drive “target programming” “GPU offloading”, in hopes that this is a separate internal construct, and thus will not share any data with the CPU OpenMP parallel regions. I will report back on what I find.