Numerical Reproducibility & Randomness


I have a question about the numerical reproducibility. This is just out of curiosity.

I parallelized a particle-based numerical code (Smoothed Particle Hydrodynamics) using MPI and OpenACC. Overall, I think it has desirable numerical accuracy with double-precision variables. However, I found code behavior somewhat interesting when I was debugging the code and tracking the exact location of the particles (x,y,z coordinate variables). The exact particle location was not the same after curtain timesteps later (let’s say 10,000 iterations) with different computing platforms.

  1. Pure MPI (only using CPUs) case: The exact location of particles was reproducible only if I used the same CPU architecture with the same # of MPI cores. If different numbers of cores or other types of CPUs are used, the results were not reproducible though there was no randomness. (multiple trials give me the same results )

  2. GPU case: With the OpenACC flag turned on, the exact location of particles was different with each trial thus, there is some randomness.

It seems using the number of cores (either CPU or GPU cores) affects reproducibility. Why is this happening? Is this something related to a “round-off error”? But CPU and GPU behave somewhat differently. Your insights will be deeply appreciated!

Thank you,

I can’t be certain what’s happening in you program, but floating point rounding error could be a factor, but the order of operations when run in parallel is also a likely contributor.

For the CPU case, it’s not uncommon to see slightly different floating point values depending on the architecture and what optimization were applied. You can try adding the flag “-Kieee” which has the compiler use strict compliance to IEEE 754. It does limit the optimizations that can be applied so often results in slower code, but should give better consistency across architectures.

Also, if you’re using single-precision, consider switching to double precision which will help mitigate the rounding error.

However it wont help with parallelism given the order of operation can change. So with different MPI ranks, you’ll have different domain decompositions which can change the order in which the operations are computed, which in turn changes the rounding error.

For the GPU, instead of a few threads, there can be thousands if not millions of threads and the order in which the threads execute is non-deterministic. Hence the rounding error due to the order of operations can be more of a factor.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.