After I develop my first kernel, I compared the results with the ones generated on the host CPU.
My kernel is called three times (the second time using data obtained in the first time and the third time using data obtained in the second time).
Here’s what I got:
1th kernel call → 972 errors greater than: 0.000000000000001 | biggest error: 0.000000119209290
2th kernel call → 575 errors greater than: 0.000000000000001 | biggest error: 0.000000178813934
3th kernel call → 75 errors greater than: 0.000000000000001 | biggest error: 0.000000074505806
It’s not like I wasn’t expecting those differences (In fact I was). So far so good.
Anyway, after I saw the Parallel Reduction example (and since my kernel does data reduction) I decided to “Halve the number of blocks, and replace single load” (see Reduction #4: First Add During Load).
I did so and afterward tested the results to make sure everything was ok. Now, I’m pretty sure the computations are the same because the results achieved in both situations are fundamentally the same. However I found weird that the difference between the calculations on the device and the CPU has changed with this change to the kernel:
1th kernel call → 1060 errors greater than: 0.000000000000001 | biggest error: 0.000000119209290
2th kernel call → 630 errors greater than:0.000000000000001 | biggest error: 0.000000238418579
3th kernel call → 52 errors greater than: 0.000000000000001 | biggest error: 0.000000089406967
I wasn’t expecting for such a behavior. My idea is that this is happening due to compiler optimizations. Any other ideas ?