To utilize SIMD in my OpenCL code on CPU targets, I cast a float  array pointer to a float8*, and use the swizzle operators to group adjacent instructions into vector operations. See my changes
For example, the previous code segment (lines#100-104)
#define FUN(x) (4.f*(x)*(1.f-(x))) t=FUN(t); t=FUN(t); t=FUN(t); t=FUN(t); t=FUN(t);
was replaced by (lines#100-101)
this change compiles and runs fine on Intel ocl and AMD ocl, but failed to run on nvidia ocl (all t values became wildly large numbers). For Intel ocl, I gained about 15% speed due to SSE operations.
I know it probably should not impact the speed on nvidia’s ocl because there is no vector register. but still, I expect the above swizzle syntax valid on nvidia and should produce the correct values.
can someone tell me if you see anything wrong with my above change?