Performance issues with llvm and openmp

If this is a known issue, I apologize as I don’t see any references to it. I’m using Community Edition 1910.

I’m compiling/running an older model where the main loop splits the dataset over threads using openmp directives. Each time step branches over the spacial data, then recombines the data after the parallel section and goes back to a single thread before proceeding to the next time step.

With the newer llvm based compiler, I get significantly degraded performance over the non-llvm compiler. While I haven’t profiled it yet, just observing “top” the process repeatedly goes back and forth between 200% (2 threads) and 100%(single thread). Using the non-llvm compiler, it stays pegged at 200% throughout. The performance difference is somewhere around 40%-50% slower with llvm.Is this a known difference between the two? It almost seems as if there is more thread overhead when launching a parallel section with the llvm compiler.

Compile options are:
pgf90 file.f -c -fast -Mfixed -mp

This isn’t a problem as long as I remember to link to the non-llvm compiler after upgrading, but I would like to understand the reason for the difference.

Thank you for continuing to make this software available to the community.

Eric

Hi Eric,

Is this a known difference between the two?

No, it not expected and something we’d consider a bug.

It almost seems as if there is more thread overhead when launching a parallel section with the llvm compiler.

The non-LLVM compiler uses our older runtime that inlined OpenMP regions to reduce overhead. The 19.10 LLVM uses the KOMP which outlines region (creates a call). There is a bit more overhead with outlining, but I doubt that this would account for much of the difference.

We’re actually transitioning to new OpenMP runtime, NVOMP, which you might try by using the flag “-mp=nvomp”. I don’t believe we’ve documented it in 19.10 since it was early access. Though, it became the default OpenMP runtime in the 20.1 release. If you don’t see equivalent performance, let’s wait until the next Community Edition (it should be next month, though our schedule has been delayed to COVID-19), and if it’s still not meeting the expected performance, I’ll ask our Customer Service folks reach out to see if we can get a reproducing example that we can give to the compiler team.

-Mat

We’re actually transitioning to new OpenMP runtime, NVOMP, which you might try by using the flag “-mp=nvomp”.

Mat,

Huge difference with the -mp=nvomp switch. It is now running at least as fast as the non-llvm compiled version.

Thanks for the quick response and excellent support.

Eric

Unfortunately, I’ve found another issue with the llvm compile versions. I get incorrect/incomplete output from the model when I run those versions. I suspect a key array is getting zero’d or overwritten somehow, which results in a check that bypasses a large number of calculations and gives the appearance of speeding up the program. When I dump the intermediate results, I get a lot of “NaN” instead of floating points, and “****” instead of a float formatted in exponential format.

in summary:
-llvm compiled serially = ok
-non-llvm compile serially = ok
-llvm compiled -mp = bad output and slow
-llvm compile -mp=nvomp = bad output and fast
-non-llvm compiled -mp = ok

Perhaps it’s a similar issue as this post, as I have a ton of private and shared arrays.
https://forums.developer.nvidia.com/t/error-using-openmp-with-community-edition-19-4/136112/1

I’m digging into it, but the code base is huge. I can share the code/data to reproduce but the input data is proprietary so I can only do it via email.

Thanks,

Eric

Hi Eric,

I’ll ask our customer service folks to contact you on the gmail account to see about getting a reproducing example.

Thanks,
Mat

Mat,

I emailed the code over to them today.

Regards,

Eric

Thanks Eric. I was able to get the code and reproduce the problem.

Looks to be an issue with the “copyin” clause. For some reason ICMAX isn’t getting updated for the second thread and has the value of “0”. This in turn causes the “k” loop to be skipped and COSLAT not initialized, which in turn causes a divide by zero and the source of the Nans.

                   if ( KSPHER > 0 ) then
                      do k = 1, ICMAX  ! Skipped since ICMAX==0
                         COSLAT(k) = cos(DEGRAD*(vert(vs(k))%attr(VERTY) + YOFFS))
                      enddo
                      do j = 1, 2
                       !! Divide by zero here since COSLAT(1) == 0.0
                         rdx(j) = rdx(j) / (COSLAT(1) * LENDEG)
                         rdy(j) = rdy(j) / LENDEG
                      enddo
                   endif

I’ve reported the issue as TPR #28356.

As a work-around, if I set ICMAX to 3 within the parallel region, the code seems to generate reasonable results (not sure they are correct, but no NaNs are produced).

    !$ tid = omp_get_thread_num()
    tid = tid + 1
#ifdef PGI_WORKAROUND
    ICMAX=3
#endif

-Mat

Mat,

Nice work. That appears to have been the culprit. I’m running an -mp=nvomp compiled binary now and will let you know if I come across any other strange behavior. So far cpu usage, speed, and output looks normal.

Thanks,

Eric