It’s not unusual to see slight differences. Are they within an acceptable tolerance?
There’s quite a bit that can cause numerical differences of floating point numbers but my guess here is that it’s due to order of operations and rounding error. Particularly operations such as reductions or atomics will change the order in which these operations are applied. Due to small rounding present in FP, changing the order will give slightly different results. For a relatively small number of threads, the error difference may not be noticed, but going massively parallel may show more differences.
Other things that can cause differences are math intrinsics (sin, cos, etc.), FMA operations, the data type’s precision (float vs double), the data (very big or very small number may not be able to be accurately represented), the compiler optimizations being applied, among others.
Some of these we can adjust (like the opt level or use of FMA or not), but differences due to parallel order of operations can’t be helped.