Emulation/CPU=correct,Execution/GPU=incorrect emulation

results from both the parallel machine and the C870 agree upto and including 6 warps, but from 7 warps onwards the results from the C870 show unexpected behaviour while the parallel machine results are as expected.

Plus there is an unexpected performance gain at warps>=7, in that the execution time drops dramatically from seconds to fractions of a millisecond.

Any suggestions?

check for errors after your kernel launch, you probably have to many registers for 7 warps per block. I think you are experiencing a too many resources requested error.

I’m beginning to suspect it could be something like that.

How do I check for the error you mention? From the example projects I have in my code

////////////////////////////////////

// Launch the device computation

SPH<<<dimGrid, dimBlock>>>(//variables);

CUT_CHECK_ERROR(“kernel error”);

////////////////////////////////////

Do I need to write something else?

Try to cudaThreadSynchronize before checking for the error value. (Actually cudaThreadSynchronize will report the error if there is one.)

I just tripped over the fact that __sinf only works within [-pi, pi] whereas in Emu mode it works correctly at all times. Similarly with __cosf, I suppose. This screwed over my Hough Transform. ;)

Don’t forget to run in Debug mode. Check the custom build step of the sample in case anything doesn’t work.

More precisely, the absolute error of __sinf and __cosf is only guaranteed within [-pi, pi]. Outside this range, it will be less accurate, but still usable for most purposes (well, maybe not yours!)

You can think of __sinf(x) as behaving much like sinf(fmodf(x, (float)TWO_PI)). Only for very large inputs it will give a completely wrong answer in absolute terms.

But even inside [-pi, pi], CPU implementations can be much more accurate than __sinf and __cosf, if accurate means a smaller relative error (think __sinf(x) when x is very close to 0).