I’ve just got a GTX 480 card that I’m comparing to my GTX 285, if I compile without “arch=sm_20” I get almost 2x speedup with 480 compared to 285. If I compile with “arch=sm_20” the 480 is however slower than my 285! I know that the pointers become 64-bit pointers with “arch=sm_20” and that the number of registers thereby increase, but even for a kernel with the same number of registers and occupancy the code becomes 4 times slower if I use “arch=sm_20”, can anyone explain this?
In some other threads, it’s been stated that arch=sm_20 enables more (fully?) compliant IEEE floating point arithmetic, which is slower than the previous floating point arithmetic. This can be disabled with some compiler flags or some such.
If you just want the .ptx part, the --ptx option will give you that without the rest. (It also doesn’t produce an output binary, so keep that in mind.)
Ok I recompiled with ftz = true but it did not help, it seems like with arch=sm_20 the compiler uses more of the constant memory and without arch=sm_20 it instead uses shared memory, why is that? Since the shared memory is much faster if there are cache misses in constant memory, that could be one explanation.
ptxas info : Compiling entry function ‘_Z41Calculate_A_matrix_and_h_vector_2D_valuesPfS_P6float
4S1_S1_iiiii’ for ‘sm_20’
ptxas info : Used 34 registers, 92 bytes cmem[0], 8232 bytes cmem[2]
ptxas info : Compiling entry function ‘_Z41Calculate_A_matrix_and_h_vector_2D_valuesPfS_P6float
4S1_S1_iiiii’ for ‘sm_13’
ptxas info : Used 30 registers, 60+16 bytes smem, 8232 bytes cmem[0], 16 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z41Calculate_A_matrix_and_h_vector_2D_valuesPfS_P6float
4S1_S1_iiiii’ for ‘sm_10’
ptxas info : Used 30 registers, 60+16 bytes smem, 8232 bytes cmem[0], 16 bytes cmem[1]
Why doesn’t the compiler take advantage of the shared memory for 2.0 ?
I think I’ve read somewhere that on Fermi all the memory transactions are always 128 bytes, is that correct?
In one of my kernels I have a lot of 32 byte reads, if they with Fermi become 128 bytes instead, I guess that it will take about 4 times as long to read 128 bytes than 32 bytes from global memory? So if the kernel is bound by memory bandwidth it could explain why my kernel becomes slower.
If this is true, how come that the kernel only becomes slower on a Fermi card if I use “arch=sm_20” but not otherwise?