I would like to add a word of caution about establishing the correctness of GPU results simply by comparing to CPU results computed at the same precision. Obviously there will be a certain amount of error in CPU results as well, and a simple comparison of the GPU and CPU results does not establish how much of the total difference is to be attributed to the error of each platform.
I have handled several reports of “incorrect” GPU results where it turned out that the fairly large differences between CPU and GPU were due to accumulated error on the CPU side, which was larger than the error on the GPU side. I found that most such scenarios could be traced back to two mechanisms:
(1) The use of FMA (fused multiply-add) on the GPU. This reduces overall rounding error and can mitigate effects of subtractive cancellation.
(2) The use of summing via tree-like reduction on the GPU which has a tendency to add quantities of similar magnitude in each step.
I consider the comparison with a high-precision reference (all intermediate computation is performed in double-double, or with a multiple precision library) the final arbiter as to which set of results is the more accurate one, and for establishing the actual error for a given platform.