results from both the parallel machine and the C870 agree upto and including 6 warps, but from 7 warps onwards the results from the C870 show unexpected behaviour while the parallel machine results are as expected.
Plus there is an unexpected performance gain at warps>=7, in that the execution time drops dramatically from seconds to fractions of a millisecond.
check for errors after your kernel launch, you probably have to many registers for 7 warps per block. I think you are experiencing a too many resources requested error.
I just tripped over the fact that __sinf only works within [-pi, pi] whereas in Emu mode it works correctly at all times. Similarly with __cosf, I suppose. This screwed over my Hough Transform. ;)
More precisely, the absolute error of __sinf and __cosf is only guaranteed within [-pi, pi]. Outside this range, it will be less accurate, but still usable for most purposes (well, maybe not yours!)
You can think of __sinf(x) as behaving much like sinf(fmodf(x, (float)TWO_PI)). Only for very large inputs it will give a completely wrong answer in absolute terms.
But even inside [-pi, pi], CPU implementations can be much more accurate than __sinf and __cosf, if accurate means a smaller relative error (think __sinf(x) when x is very close to 0).