-arch sm_13 vs -arch sm_20 (sm_20 slower on C2050)

I know this might be rather broad without context of my application, but is there any reason why I am seeing slower run times when I compile my code with -arch sm_20 as opposed to arch sm_13?

The app is running on a Tesla C2050 with CUDA 3.2 on a Linux platform. If I compile with -arch sm_13 the average runtime is 38.6 +/- .01 seconds. If I make absolutely no changes to the code and I just change the makefile so it compiles with -arch sm_20, runtimes drop to 34.6 +/- .01 seconds. I would expect at the least the runtimes to stay the same? The thread block size is the only CUDA specific parameter and was optimized for a Tesla C1060, but even if I change the thread block size up or down for the C2050 and recompile with sm_20, the best runtime is the 34.6 second time.

Sorry, but what’s your point? Still it runs faster compiling with -arch sm_20.

Was it fps like value?

Yes, number of reasons may make code with sm_20 works a bit slower. Higher register pressure, etc. Do you need 64 bit in cuda core? try to compile with -m32, if it works on linux. Also check flz mode and flag. My application also runs a bit slower with 2.0, a few percents. I found it rather good, so I can compiler with 1.2 to target all gpus at once. Without regretting of peformance of geforce100, I do not need its new features now.

sm_20 has tighter IEEE precision requirements (in terms of rounding modes and the like), which can impact performance versus sm_13.

OK thanks. I did make a mistake in the original post where I switched the running times from sm_13 to sm_20, it slows down when using sm_20. I suppose the precision could be an issue. I do know the majority of operations are not floating point operations but integer operations. I do not need 64 bit so I’ll give 32 bit a try and see if that makes a difference. It’s not really a huge issue because I can keep compiling with sm_13 for the best performance, I was just trying to track down exact reasons as to why there is such a slow down, when I had expected things to be at the least the same, if not speed up.

To expand on what tmurray said, try adding the following to your sm_20 build: -ftz=true -prec-sqrt=false -prec-div=false. This will configure the single-precision arithmetic for sm_20 as closely as possible to the sm_1x configuration. See also section 5.4.1 and appendix G of the Programming Guide.

These are default options for 2.0

The compiler defaults for sm_2x are: -ftz=false -prec-sqrt=true -prec-div=true, i.e. the inverse of the settings I suggested to try.

How can you comment such ptx output?

mul.ftz.f32

rcp.approx.ftz

div.approx.ftz.f32

sqrt.approx.ftz

and

mul.f32

sqrt.rn.f32

I am not sure what the question is. Below is a table that shows how various source-level operations of interest are mapped to PTX instructions for sm_2x targets, based on the three flags -ftz, -prec-sqrt, -prec-div. The compiler default for an sm_2x target is denormal support and single-precision reciprocal, division, and square root with IEEE compliant rounding (i.e. -ftz=false -prec-sqrt=true -prec-div=true). The description of each PTX instruction mentioned in the table below can be found in the PTX specification document.

===================================================================================

                                                     -ftz=false

                =========================================+=========================================

                             -prec-div=false             |             -prec-div=true

                ====================+====================+=====================+===================

                -prec-sqrt=false    | -prec-sqrt=true    | -prec-sqrt=false    | -prec-sqrt=true

                ====================+====================+=====================+===================

1.0f/x          rcp.approx.f32      | rcp.approx.f32     | rcp.rn.f32          | rcp.rn.f32

__fdividef(x,y) div.approx.f32      | div.approx.f32     | div.approx.f32      | div.approx.f32

x/y             div.full.f32        | div.full.f32       | div.rn.f32          | div.rn.f32 

sqrtf(x)        sqrt.approx.f32     | sqrt.rn.f32        | sqrt.approx.f32     | sqrt.rn.f32

                ===================================================================================

                                                     -ftz=true

                =========================================+=========================================

                             -prec-div=false             |             -prec-div=true

                =========================================+=========================================

                -prec-sqrt=false    | -prec-sqrt=true    | -prec-sqrt=false    | -prec-sqrt=true

                ====================+====================+=====================+===================

1.0f/x          rcp.approx.ftz.f32  | rcp.approx.ftz.f32 | rcp.rn.ftz.f32      | rcp.rn.ftz.f32

__fdividef(x,y) div.approx.ftz.f32  | div.approx.ftz.f32 | div.approx.ftz.f32  | div.approx.ftz.f32

x/y             div.full.ftz.f32    | div.full.ftz.f32   | div.rn.ftz.f32      | div.rn.ftz.f32 

sqrtf(x)        sqrt.approx.ftz.f32 | sqrt.rn.ftz.f32    | sqrt.approx.ftz.f32 | sqrt.rn.ftz.f32

                ===================================================================================

The information in the above table can easily be verified by compiling the code below with nvcc --keep -arch=sm_20 [-ftz={true|false} -prec-div={true|false} -prec-sqrt={true|false] and inspecting the .ptx file generated.

__global__ void kernel (float *res, float a, float b, float c, float d)

{

    res[0] = 1.0f / a;

    res[1] = b / c;

    res[2] = sqrtf (d);

}

int main (void) {

    return 0;

}

Well, I suppose this isn’t precision related. Here are the results I get when compiling with different arguments. The first column is problem size, the second is the running time (including data transfers) and the third column is an application specific performance measurement (higher = better).

When compiled with -arch sm_20 -ftz=true -prec-sqrt=false -prec-div=false:

144 1.14882 23.1487

189 1.48418 23.5175

222 1.70787 24.0057

375 2.76831 25.0168

406 2.9902 25.075

567 4.11417 25.4517

657 4.79557 25.3012

729 5.29158 25.4424

1000 7.13668 25.8773

1500 10.8169 25.6098

2005 14.325 25.8484

2504 17.955 25.7552

3005 21.4604 25.8596

3564 25.5347 25.7764

4061 28.9836 25.8759

4548 32.5644 25.7925

4743 33.8949 25.8425

5147 36.9179 25.7474

5478 39.1978 25.8093

Here’s the results with just -arch sm_20:

144 1.14943 23.1364

189 1.48521 23.5011

222 1.70765 24.0087

375 2.76974 25.0039

406 2.99631 25.0239

567 4.11815 25.4271

657 4.80205 25.267

729 5.29878 25.4078

1000 7.14601 25.8436

1500 10.8305 25.5774

2005 14.3447 25.813

2504 17.9767 25.7241

3005 21.488 25.8264

3564 25.5671 25.7437

4061 29.0229 25.8409

4548 32.6015 25.7631

4743 33.9418 25.8068

5147 36.9685 25.7121

5478 39.2545 25.772

And the results with just -arch sm_13:

144 1.02598 25.9204

189 1.32068 26.4289

222 1.51603 27.0434

375 2.45256 28.2375

406 2.64965 28.2978

567 3.66202 28.5942

657 4.24658 28.5721

729 4.68574 28.732

1000 6.31158 29.2602

1500 9.57007 28.9462

2005 12.6652 29.236

2504 15.8794 29.1216

3005 18.9692 29.2558

3564 22.5808 29.1484

4061 25.6204 29.2727

4548 28.7977 29.1661

4743 29.9685 29.2284

5147 32.6465 29.1161

5478 34.6674 29.1821

Any other thoughts? We do make heavy use of registers which limits our thread block size to 256 although theoretically we should get performance increases the larger the thread block. Although the only way we can use 512 thread blocks (or anything much higher than 256) is compiling with sm_20 but this performance decrease outweighs any gains we get from the larger thread blocks.

I told you, -ftz=true -prec-sqrt=false -prec-div=false is defualt for sm_20. At least, for some sdk, OS and compilers.

Try with -ftz=false -prec-sqrt=true -prec-div=true and you may see difference.

Actually your results confirm that -ftz=true -prec-sqrt=false -prec-div=false default for sm_20, cause no difference in resaults, if your program is not totaly memory bound.

I got this output

mul.ftz.f32

rcp.approx.ftz

div.approx.ftz.f32

sqrt.approx.ftz

with just sm 2.0, no other flags set.

So, on my system default values are ftx=true, -prec_sqrt=false, -prec-div=false for sm_20

@dhains: Sorry, I cannot think of anything else that would explain the performance difference you are seeing.

@Lev: I just tried with CUDA 3.2 on three different systems (WinXP32, WinXP64, RHEL Linux64) and the results are consistent. I built the code I posted with

nvcc --keep -arch=sm_20 kernel.cu

The relevant part of the generated kernel.ptx file (here: from the WinXP64 system) shows operations with the .rn suffix and without the .ftz suffix, i.e. with IEEE rounding and with denormal support.

.entry _Z6kernelPfffff (

                .param .u64 __cudaparm__Z6kernelPfffff_res,

                .param .f32 __cudaparm__Z6kernelPfffff_a,

                .param .f32 __cudaparm__Z6kernelPfffff_b,

                .param .f32 __cudaparm__Z6kernelPfffff_c,

                .param .f32 __cudaparm__Z6kernelPfffff_d)

        {

        .reg .u64 %rd<3>;

        .reg .f32 %f<9>;

        .loc    29      1       0

$LDWbegin__Z6kernelPfffff:

        .loc    29      3       0

        ld.param.u64    %rd1, [__cudaparm__Z6kernelPfffff_res];

        ld.param.f32    %f1, [__cudaparm__Z6kernelPfffff_a];

        rcp.rn.f32      %f2, %f1;

        st.global.f32   [%rd1+0], %f2;

        .loc    29      4       0

        ld.param.f32    %f3, [__cudaparm__Z6kernelPfffff_b];

        ld.param.f32    %f4, [__cudaparm__Z6kernelPfffff_c];

        div.rn.f32      %f5, %f3, %f4;

        st.global.f32   [%rd1+4], %f5;

        .loc    29      5       0

        ld.param.f32    %f6, [__cudaparm__Z6kernelPfffff_d];

        sqrt.rn.f32     %f7, %f6;

        st.global.f32   [%rd1+8], %f7;

        .loc    29      6       0

        exit;

$LDWend__Z6kernelPfffff:

        } // _Z6kernelPfffff

I tested it with version 3.0 on win7 64. Maybe compiler configuration issue. Maybe something else. Need somebodyelse to check, I got slower results on my system when I put flags flt=false, -prec-sqrt=true, -prec-div=true by myself.

Interesting what are dhains’s settings.

Sorry, I don’t have an explanation. The compiler defaults for these flags have not changed since CUDA 3.0. I did a quick test with CUDA 3.0 on RHEL Linux64 and WinXP64 (using indentical commandline: nvcc --keep -arch=sm_20 kernel.cu) and the PTX output matches what I posted previously for the CUDA 3.2 toolchain.

I suppose dhains could also have same situation as me, cause he has same results with default options and with speed options.

Well, after running computeprof on my kernel the biggest differences between sm_13 and sm_20 are the following:

arch | instructions issued | active warps | active cycles|

-----|---------------------|--------------|--------------|

sm_13|         3.33861e+08 |  3.06033e+09 |   2.57639e+08|

sm_20|         3.93182e+08 |  3.42983e+09 |   2.94618e+08|

The other significant difference is in the summary table and the global memory throughputs:

arch | glob mem read throughput | glob mem write throughput | glob mem overall throughput |

-----|--------------------------|---------------------------| ----------------------------|

sm_13|                  28.5311 |                   18.8527 |                     47.3838 |

sm_20|                  24.6685 |                   16.3238 |                     40.9923 |

So I don’t understand why the throughputs are higher on the sm_20, when the number of cache hits/misses and global read/load requests are identical between sm_13 and sm_20. I’m also not sure why there are more instructions issued in the sm_20 version. The combination of these things though is seemingly the cause of the performance loss I’m seeing when I compile with sm_20.

If you are a registered developer, I would suggest to file a compiler bug for this. Please attach self-contained code that allows the issue to be reproduced. Thanks!

Ok, I found out what had happen. It is compiler switch use_fast_math that has changed default values for flush to zero mode. And this was not documented. So, maybe theme author used fast math.