Hi,
I’m experiencing a strange problem. I have ported a legacy Fortran code to run on GPUs using CUDA Fortran, device global variables declared in a module, and kernel loop directives. The results are great in terms of speed compared to the CPU version, and are reproducible, i.e., the results do not change from one run to another when using the same set of inputs. Subsequently, to add more capabilities, I declared some additional device global variables. However, just adding new variables to the common module makes the results unreproducible - using the same set of inputs, results differ from the original code and differ from one run to another. Commenting out the new variable declarations or moving the new variable declarations to a different position in the module (following some other declarations) gives me the same results as the original code. While the issue seems to have the symptoms of an out-of-bounds memory access or a race condition, compute-sanitizer using memcheck and racecheck show zero errors. While in the short term I have a way forward (moving the variable declarations to a different position in the module), I am afraid this issue may pop up again in some other context. I would be grateful if anyone has any suggestions on debugging or ideas on what the issue might be. Thanks!