Code 4 times slower with "arch=sm_20"

wanderine · June 7, 2010, 12:01pm

I’ve just got a GTX 480 card that I’m comparing to my GTX 285, if I compile without “arch=sm_20” I get almost 2x speedup with 480 compared to 285. If I compile with “arch=sm_20” the 480 is however slower than my 285! I know that the pointers become 64-bit pointers with “arch=sm_20” and that the number of registers thereby increase, but even for a kernel with the same number of registers and occupancy the code becomes 4 times slower if I use “arch=sm_20”, can anyone explain this?

What exactly does “arch=sm_20” do?

avidday · June 7, 2010, 12:05pm

How many registers are you talking about here?

Jeroen · June 7, 2010, 12:29pm

Do you by any chance make use of doubles in your code?

wanderine · June 7, 2010, 12:54pm

34 registers, both with and without arch=sm_20

No doubles, float only

cbuchner1 · June 7, 2010, 1:22pm

It enables the handbrake.

Humor aside, have you tried inspecting the PTX code for clues?

wanderine · June 7, 2010, 1:43pm

I removed some “__umul24” and that reduced the time a bit, otherwise I don’t know what to change. How do I look at the ptx-code?

cbuchner1 · June 7, 2010, 2:01pm

you need to pass the “-keep” option to nvcc (as part of the CUDA build rules)

In the end there will be .ptx files in the same folder as the .cu files. And lots of other intermediate junk from the build process.

StickGuy · June 7, 2010, 2:04pm

In some other threads, it’s been stated that arch=sm_20 enables more (fully?) compliant IEEE floating point arithmetic, which is slower than the previous floating point arithmetic. This can be disabled with some compiler flags or some such.

seibert · June 7, 2010, 2:47pm

If you just want the .ptx part, the --ptx option will give you that without the rest. (It also doesn’t produce an output binary, so keep that in mind.)

wanderine · June 7, 2010, 2:53pm

Can you link to any of those? I have searched but not found any…

wanderine · June 7, 2010, 3:42pm

Ok I recompiled with ftz = true but it did not help, it seems like with arch=sm_20 the compiler uses more of the constant memory and without arch=sm_20 it instead uses shared memory, why is that? Since the shared memory is much faster if there are cache misses in constant memory, that could be one explanation.

wanderine · June 7, 2010, 3:48pm

Example

ptxas info : Compiling entry function ‘_Z41Calculate_A_matrix_and_h_vector_2D_valuesPfS_P6float
4S1_S1_iiiii’ for ‘sm_20’
ptxas info : Used 34 registers, 92 bytes cmem[0], 8232 bytes cmem[2]

ptxas info : Compiling entry function ‘_Z41Calculate_A_matrix_and_h_vector_2D_valuesPfS_P6float
4S1_S1_iiiii’ for ‘sm_13’
ptxas info : Used 30 registers, 60+16 bytes smem, 8232 bytes cmem[0], 16 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z41Calculate_A_matrix_and_h_vector_2D_valuesPfS_P6float
4S1_S1_iiiii’ for ‘sm_10’
ptxas info : Used 30 registers, 60+16 bytes smem, 8232 bytes cmem[0], 16 bytes cmem[1]

Why doesn’t the compiler take advantage of the shared memory for 2.0 ?

tmurray · June 7, 2010, 4:08pm

sm_20 doesn’t pass parameters via smem, for one.

Sylvain_Collange · June 7, 2010, 4:13pm

The flags you are looking for are -prec-div=false and -prec-sqrt=false.

(See the Fermi Tuning Guide).

wanderine · June 7, 2010, 7:22pm

I tried those as well but no difference, I mostly do multiplications. Maybe Fermi does not like the patterns of my memory reads…

eyalhir74 · June 8, 2010, 6:38am

How does sm_20 pass parameters to the kernels then?

thanks

eyal

avidday · June 8, 2010, 6:46am

Kernel parameters go via constant memory in Fermi.

wanderine · June 9, 2010, 7:32am

I think I’ve read somewhere that on Fermi all the memory transactions are always 128 bytes, is that correct?

In one of my kernels I have a lot of 32 byte reads, if they with Fermi become 128 bytes instead, I guess that it will take about 4 times as long to read 128 bytes than 32 bytes from global memory? So if the kernel is bound by memory bandwidth it could explain why my kernel becomes slower.

If this is true, how come that the kernel only becomes slower on a Fermi card if I use “arch=sm_20” but not otherwise?

sigismondo · June 9, 2010, 9:57am

is it possible that without specifying it, you have doubles demoted to floats, and so it is faster?

wanderine · June 9, 2010, 11:00am

I don’t use doubles.

Topic		Replies	Views
-arch sm_13 vs -arch sm_20 (sm_20 slower on C2050) CUDA Programming and Performance	21	7273	December 21, 2010
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22472	May 5, 2010
Register usage difference between sm_13 and sm_20 Many more registers used when compiling for sm_20 CUDA Programming and Performance	6	10872	August 11, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11224	May 23, 2010
Wrong results with -arch=sm_20 on a compute capability 2.0 GPU -arch=sm_13 and -arch=sm_20 does not CUDA Programming and Performance	5	10598	April 16, 2011
why slower with flags "-arch; sm_20" CUDA Programming and Performance	8	1224	September 9, 2011
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1919	October 28, 2016
Using fast_math used to be much faster on arch 2.0 and 3.0 but is now even slower on arch 3.5 and up ! CUDA Programming and Performance	19	2257	October 31, 2016
output difference between quadro K600 and K620 CUDA Programming and Performance	13	4929	December 2, 2014
gtx 465 performance CUDA Programming and Performance	34	4332	August 18, 2010

Code 4 times slower with "arch=sm_20"

Related topics