Getting different results when using pgfortran vs. gfortran for same exact problem

In a computational fluid dynamics solver I have written, I am getting a different result based on the compiler I use. When I compile this code using gfortran with the following flags: -ffree-line-length-512 -Ofast -march=native , I get the “correct” results. However, when I run the same code with pgfortran using the flags -fast -acc -Minfo=accel -ta=tesla:managed , it seems like the boundaries of my domain become incorrect.

To test if this could be a compiler issue, I commented out all OpenACC directives and compiled the code using the abovementioned pgfortran and flags. I still got the same incorrect results. I then removed all flags and compiled the code only using pgfortran with no flags. I still got the same incorrect results.

I am not sure what the issue is or even if this really is a compiler issue. However, with exactly the same code, just with different compilers I am getting considerably different results at my domain boundaries and I am not sure what to do.

Appreciate any advice and if there is any other information I can provide I would be happy to do so.

There may be many reasons. I would check:

  1. uninitialized variables/out of bound access
  2. print min/max of arrays after each routine to see when the difference shows up
  3. Try to lower optimization (starting from -O0, increasing to 1, 2 and 3)

I would try to get the correct results on CPU only (and there is no need to remove OpenACC, if you don’t compile with -acc, directive will just be ignored).

To add to Mass’ answer.

If the difference is large, it’s most likely uninitialized variable(s) or out of bounds access issue. nvfortran doesn’t implicitly initialize variables to zero as some other compilers do. Though you can try adding “-Msave”, which has the compiler apply the SAVE attribute to all variables, but also the side effect of initializing global (module) variables to zero. Note I find the Valgrind (www.valgrind.org) utility very helpful in finding uninitialized variables and out-of-bound access errors.

If the results are close but not exact, look to instruction choice and optimization level.

We use FMA (fuse-multiply add) on systems which support the instruction. While more precise than the equivalent multiply followed by and add due to less rounding error, it can cause numerical difference. Try adding “-Mnofma” to disable.

There’s also the “-Kieee” flag, which has the compiler adhere to strict compliance with IEEE754. Though with the exception of the “-Mfprelaxed” flag (not enabled by default), even without -Kieee we’re usually only 1-2ULPS off so quite accurate.

Also keep in mind that applying parallelism will also impact accuracy since rounding error would be different due to the order of operations. Some small variance is to be expected between compilers, optimization applied, and architectures. It’s best to allow for some tolerances rather than seek bit-for-bit exactness.

Thanks everyone for your comments and suggestions, I really appreciate them all.

I figured out what the issue was. I just want to be clear that when I was testing compiler vs. compiler I was not using any flags for either to make sure that it would not be an optimization issue or something else.

Essentially, the error came from my boundary conditions, for which I use ghost cells. In the isothermal wall boundary condition, I was computing the temperature, setting a zero gradient condition for the pressure, and then computing the density from temperature and pressure.

These boundary conditions were being set as:

T_g = 2*T_b - T_i

p_g = p_i

rho_g = gamma*p_g/T_g

where the subscripts g, b, i denote ghost, boundary, and interior cell, respectively. In a sequential manner, this produces the intended result, since p_g and T_g are set before rho_g and thus I may compute rho_g from them. I guess somehow this was not being done sequentially when running pgfortran so p_g and T_g were not set before computing density. I have no idea how this makes sense since without the -acc flag the code should just run sequentially. However, modifying the density boundary condition to be:

rho_g = gamma*p_i/T_i

fixed the issue.