The code posted does not seem to be the complete code, so by necessity the following discussion will have to remain somewhat speculative. I assume it has been established that the inputs to the function shown are identical in the CPU and GPU versions, and that data corruption (e.g. due to out of bounds accesses, race conditions) has been ruled out both by code inspection and testing.
It is not clear to me how sensitive the computation is to small numerical differences, but there are multiple reasons why intermediate GPU and CPU results may be slightly different. This in turn may cause differences in the output if such sensitivity exists. BTW, is this running on an sm_1x or sm_2x platform?
(1) On sm_1x platforms, and on sm_2x platforms with -ftz=true, very small numbers (so called denormals) are flushed to zero when running on the GPU. While such an FTZ mode also exists in SSE, it is unlikely that it is turned on by default, and the host compiler may not even target SSE. In my experience, this is an infrequent issue (with a few exceptions, code is not usually prone to frequent underflows).
(2) On sm_1x platforms, and on sm_2x platforms with -prec-div=false, the following results in an approximate, not IEEE-rounded reciprocal:
d = 1.0f / d;
On the CPU, an IEEE-rounded reciprocal would likely be generated here. For the sake of experiment, you could replace this as follows to force a reciprocal with IEEE round to-nearest-or-even on the GPU:
d = __frcp_rn(d);
(3) There are multiple opportunities in this code for the compiler to generate either FMADs (on sm_1x), or FFMAs (on sm_2x), which are merged multiply-and-add operations. For example, the following expression naturally lends itself to the use of FMAD/FMA:
vec3 color_persp = a*v0.color + b*v1.color + c*v2.color;
The results from either FMAD or FFMA are not generally identical to a sequence of FMUL followed by FADD, as would be used on the CPU. FFMA on sm_2x follows the IEEE-754 (2008) specification for fused multiply-add and will in general provide better accuracy than the combination of FMUL and FADD. To force the use of separate FMUL and FADD, one can use the __fmul_rn(), __fadd_rn() device functions, the use of which prevents the compiler from merging these operations into FMAD/FFMA.
(4) Even though the code uses float variables throughout, depending on how it is compiled on the CPU, much of the CPU computation may actually be performed at higher precision. In particular this would likely be the case if the compiler targets the x87 FPU [i.e. not SSE], in which case most floating-point operations would map to in-register computation that is performed with double-precision (on Windows) or extended-precision (on Linux).
While there are host compiler flags designed to reduce “excess precision” being carried in temporary variables (I think /Op for MSVC, -ffloat-store for gcc), the only way to test that hypothesis rigorously on the CPU is to declare all float variables volatile, and also break up the computations into individual operations. For example, this
float z_affine = a*v0.position.z + b*v1.position.z + c*v2.position.z;
would be turned into this:
volatile float z_affine;
z_affine = a*v0.position.z;
z_affine += b*v1.position.z;
z_affine += c*v2.position.z;
In my personal experience, significant numerical differences observed between CPU and GPU computation on floats are often due to issue (4), but combinations of multiple of the issues mentioned above are also not uncommon.