Code 4 times slower with "arch=sm_20"

I’ve just got a GTX 480 card that I’m comparing to my GTX 285, if I compile without “arch=sm_20” I get almost 2x speedup with 480 compared to 285. If I compile with “arch=sm_20” the 480 is however slower than my 285! I know that the pointers become 64-bit pointers with “arch=sm_20” and that the number of registers thereby increase, but even for a kernel with the same number of registers and occupancy the code becomes 4 times slower if I use “arch=sm_20”, can anyone explain this?

What exactly does “arch=sm_20” do?

How many registers are you talking about here?

Do you by any chance make use of doubles in your code?

34 registers, both with and without arch=sm_20

No doubles, float only

It enables the handbrake.

Humor aside, have you tried inspecting the PTX code for clues?

I removed some “__umul24” and that reduced the time a bit, otherwise I don’t know what to change. How do I look at the ptx-code?

you need to pass the “-keep” option to nvcc (as part of the CUDA build rules)

In the end there will be .ptx files in the same folder as the .cu files. And lots of other intermediate junk from the build process.

In some other threads, it’s been stated that arch=sm_20 enables more (fully?) compliant IEEE floating point arithmetic, which is slower than the previous floating point arithmetic. This can be disabled with some compiler flags or some such.

If you just want the .ptx part, the --ptx option will give you that without the rest. (It also doesn’t produce an output binary, so keep that in mind.)

Can you link to any of those? I have searched but not found any…

Ok I recompiled with ftz = true but it did not help, it seems like with arch=sm_20 the compiler uses more of the constant memory and without arch=sm_20 it instead uses shared memory, why is that? Since the shared memory is much faster if there are cache misses in constant memory, that could be one explanation.

Example

ptxas info : Compiling entry function ‘_Z41Calculate_A_matrix_and_h_vector_2D_valuesPfS_P6float
4S1_S1_iiiii’ for ‘sm_20’
ptxas info : Used 34 registers, 92 bytes cmem[0], 8232 bytes cmem[2]

ptxas info : Compiling entry function ‘_Z41Calculate_A_matrix_and_h_vector_2D_valuesPfS_P6float
4S1_S1_iiiii’ for ‘sm_13’
ptxas info : Used 30 registers, 60+16 bytes smem, 8232 bytes cmem[0], 16 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z41Calculate_A_matrix_and_h_vector_2D_valuesPfS_P6float
4S1_S1_iiiii’ for ‘sm_10’
ptxas info : Used 30 registers, 60+16 bytes smem, 8232 bytes cmem[0], 16 bytes cmem[1]

Why doesn’t the compiler take advantage of the shared memory for 2.0 ?

sm_20 doesn’t pass parameters via smem, for one.

The flags you are looking for are -prec-div=false and -prec-sqrt=false.

(See the Fermi Tuning Guide).

I tried those as well but no difference, I mostly do multiplications. Maybe Fermi does not like the patterns of my memory reads…

How does sm_20 pass parameters to the kernels then?

thanks

eyal

Kernel parameters go via constant memory in Fermi.

I think I’ve read somewhere that on Fermi all the memory transactions are always 128 bytes, is that correct?

In one of my kernels I have a lot of 32 byte reads, if they with Fermi become 128 bytes instead, I guess that it will take about 4 times as long to read 128 bytes than 32 bytes from global memory? So if the kernel is bound by memory bandwidth it could explain why my kernel becomes slower.

If this is true, how come that the kernel only becomes slower on a Fermi card if I use “arch=sm_20” but not otherwise?

is it possible that without specifying it, you have doubles demoted to floats, and so it is faster?

I don’t use doubles.