My code doesn’t depend on sm level. I can build it with sm10, If I want. But when I tried to build it with 1.3 instead of 2.0, as I did it before, I got x1.25 performance with no code changes!
sm20 -> 35ms
sm13 -> 25ms
After that gorgeous results, I tried to box/unbox every option in project settings->CUDA settings->all :) I guess, I found the stuff, which made that awesome speed:
If I use sm13 with “no fast math generation” (further fm - fast math), I have 25ms
If I use sm13 with fm, I have 25ms
sm20 without fm = 35ms
sm20 width fm = 25ms (that is the same result)
Why is this so? Maybe sm13 forces using hardware maths, but sm20 not? Or it is only coincidence, and the latter sm level have lower performance, refer to lower sm level programs?