The CPU may execute instructions out of order indeed, but that’s only when they’re independent. As long as the assembly is correct, we always got correct result.
Also, the solely purpose of the whole operation order thing is usually just to improve stability without losing too much performance. For example, when one translates a geometric primitive and its bounding box into another coordinate system, one usually want the bounding box to remain valid. If the operations are carried out in exactly similar manner (as written by the programmer), this is always true. But when the program meets an over-enthusiastic optimizing compiler, after some loop unrolling, *+ to mad, CSE and stuff, this may no longer hold. One would have to recompute the bounding box or add some annoying epsilon to keep the bounding box valid. Even worse, when the vertices of a single primitive are translated in different manner (even slight), it may become degenerate, and that would force us to put uber-costly degeneracy check everywhere. I don’t think this fall into the category of “not stable to be practical”.
That’s what I once encountered in nvcc (I did the loop unrolling by hand, otherwise nvcc stores everything in local memory). Currently I’m able to get around this using volatile shared memory (nvcc still can’t compile my program with -Xopencc -O0), but I’d like to be sure about whether I’m going to meet similar problems in ptxas, or the driver (not very likely, I guess, but I still want to be sure).
I know compiler guys really love to do aggressive optimizations, but the 1st UIUC course of CUDA did mention the importance of correctness, right?
By the way, what do you mean by presorting?