sm-level : 1.3 vs 2.0 performance (first wins O_o)

My code doesn’t depend on sm level. I can build it with sm10, If I want. But when I tried to build it with 1.3 instead of 2.0, as I did it before, I got x1.25 performance with no code changes!
sm20 → 35ms
sm13 → 25ms

After that gorgeous results, I tried to box/unbox every option in project settings->CUDA settings->all :) I guess, I found the stuff, which made that awesome speed:
If I use sm13 with “no fast math generation” (further fm - fast math), I have 25ms
If I use sm13 with fm, I have 25ms
sm20 without fm = 35ms
sm20 width fm = 25ms (that is the same result)

Why is this so? Maybe sm13 forces using hardware maths, but sm20 not? Or it is only coincidence, and the latter sm level have lower performance, refer to lower sm level programs?

Moved to a better forum

The IDE is a visual studio. I tested both (13 and 20) with debug version

As the documentation points out, for single-precision computation, code generated using the compiler flags -ftz=true -prec-div=false -prec-sqrt=false for targets >= sm_20 most closely approximates the code generated for sm_1x.

The sm_1x architecture had a number of restrictions that were removed in sm_20 and later platforms. As a consequence, the defaults for single-precision computation were more closely aligned with what programmers are used to from host (CPU) computation, which created a “saner” numerical environment: support for denormals, division and square root rounded according to IEEE-754 round-to-nearest-even mode.

However, these niceties come at some modest cost in performance (“no free lunch”), which is why we also added compiler switches to turn them off for those programmers who care for speed first and foremeost. The compiler switch -use_fast_math implies -ftz=true -prec-div=false -prec-sqrt=false

I would suggest reading the Programming Guide and the Best Practices Guide, they address this and other issues.

Visual Studio sets the -G flag on by default, so if you want better performance make sure it is off.

Thanks a lot, it works well now :) (but for some reason again, sm13 with -fast math and -ftz=true -prec-div=false -prec-sqrt=false works for 10ms faster then sm20 with average time 335ms for sm13, and 345ms for sm20. I guess, it’s precision, so not important)
p.s I will read Guide, but latter. It is challenge for me to read 200 pages on English

Yes, I know. But I cannot sleep not knowing why was there a problem with performance between 13 and 20 :)

Maybe just a simple thing. The previous arch used less registers. For some programs this can lead to higher occupancy. For a fair comparison I suggest to find the optimal number of threads per block for each arch. Also for sm 20 and higher use launch bounds can produce higher occupancy at a cost of some registers spilling.

Have you run the Visual Profiler tool that comes with the Cuda toolkit? Perhaps the performance analyzer that comes with nSight? I think it would give you the best idea on what is the problem.

My guess is that the sm 2.0 version was not fully optimized.