Basically, I have a function that takes 75 registers. if I compile the code with --maxrregcount=70 the final result is the same as without specifying the maximum # of registers. but if I make --maxrregcount=60, the result is different, say 0.657062 instead of 1.309587… can this be caused by something else rather than cuda bug?
Operation order may change with register optimization. Since fp arithmetic is not associative due to finite precision, this may affect the result.
Is there a way for you to test the code with values that are integers? Meaning that all arithmetic is still fp, but you should be getting integer results. This would test whether rearrangement of associative opertions is the cause of different results. If it’s incorrect even for integer results, then it’s a bug you should file (with a stripped-down version of the source reproducing it).
Also, if you have the same code for CPU, try compiling it with different fp optimization flags and check the range of results for the same input.