I have code which produces identical output when run multiple times under the same version of CUDA. I upgraded from CUDA 8 to CUDA 9 and obtain results which become order .1% after around 10^5 iterations of my algorithm. As I’m using double precision, I wouldn’t expect errors to be above 10^-11 at this point, naively. The initial conditions used are identical to machine precision across CUDA versions.
Without knowing the algorithm and the structure of the code, it’s quite difficult to say.
A code that depends on multiple floating point operations running in separate blocks and then accumulated together (e.g. a large parallel reduction) could have significant variability well above ULP type error depending on execution order and the actual data.
It’s entirely possible for block scheduling order to change. It is not specified.
.1% does seem large, so it may be that your code has defects, and it may also be that CUDA 9 has defects.
Yeah, I figured. I figured my code is robust to run-time FPA variation based on the results I’d seen, and am just unsure whether that constrains the variation which could be induced by compiler differences. I’ve actually been surprised to see the same output from run to run (under the same CUDA toolkit version), since my understanding is that there’s no reason for the same executable to perform operations in identical orders from execution to execution. Currently going through the painful process of trying to diagnose possible issues…