I know this might be rather broad without context of my application, but is there any reason why I am seeing slower run times when I compile my code with -arch sm_20 as opposed to arch sm_13?
The app is running on a Tesla C2050 with CUDA 3.2 on a Linux platform. If I compile with -arch sm_13 the average runtime is 38.6 +/- .01 seconds. If I make absolutely no changes to the code and I just change the makefile so it compiles with -arch sm_20, runtimes drop to 34.6 +/- .01 seconds. I would expect at the least the runtimes to stay the same? The thread block size is the only CUDA specific parameter and was optimized for a Tesla C1060, but even if I change the thread block size up or down for the C2050 and recompile with sm_20, the best runtime is the 34.6 second time.
Yes, number of reasons may make code with sm_20 works a bit slower. Higher register pressure, etc. Do you need 64 bit in cuda core? try to compile with -m32, if it works on linux. Also check flz mode and flag. My application also runs a bit slower with 2.0, a few percents. I found it rather good, so I can compiler with 1.2 to target all gpus at once. Without regretting of peformance of geforce100, I do not need its new features now.
OK thanks. I did make a mistake in the original post where I switched the running times from sm_13 to sm_20, it slows down when using sm_20. I suppose the precision could be an issue. I do know the majority of operations are not floating point operations but integer operations. I do not need 64 bit so I’ll give 32 bit a try and see if that makes a difference. It’s not really a huge issue because I can keep compiling with sm_13 for the best performance, I was just trying to track down exact reasons as to why there is such a slow down, when I had expected things to be at the least the same, if not speed up.
To expand on what tmurray said, try adding the following to your sm_20 build: -ftz=true -prec-sqrt=false -prec-div=false. This will configure the single-precision arithmetic for sm_20 as closely as possible to the sm_1x configuration. See also section 5.4.1 and appendix G of the Programming Guide.
I am not sure what the question is. Below is a table that shows how various source-level operations of interest are mapped to PTX instructions for sm_2x targets, based on the three flags -ftz, -prec-sqrt, -prec-div. The compiler default for an sm_2x target is denormal support and single-precision reciprocal, division, and square root with IEEE compliant rounding (i.e. -ftz=false -prec-sqrt=true -prec-div=true). The description of each PTX instruction mentioned in the table below can be found in the PTX specification document.
The information in the above table can easily be verified by compiling the code below with nvcc --keep -arch=sm_20 [-ftz={true|false} -prec-div={true|false} -prec-sqrt={true|false] and inspecting the .ptx file generated.
__global__ void kernel (float *res, float a, float b, float c, float d)
{
res[0] = 1.0f / a;
res[1] = b / c;
res[2] = sqrtf (d);
}
int main (void) {
return 0;
}
Well, I suppose this isn’t precision related. Here are the results I get when compiling with different arguments. The first column is problem size, the second is the running time (including data transfers) and the third column is an application specific performance measurement (higher = better).
When compiled with -arch sm_20 -ftz=true -prec-sqrt=false -prec-div=false:
Any other thoughts? We do make heavy use of registers which limits our thread block size to 256 although theoretically we should get performance increases the larger the thread block. Although the only way we can use 512 thread blocks (or anything much higher than 256) is compiling with sm_20 but this performance decrease outweighs any gains we get from the larger thread blocks.
I told you, -ftz=true -prec-sqrt=false -prec-div=false is defualt for sm_20. At least, for some sdk, OS and compilers.
Try with -ftz=false -prec-sqrt=true -prec-div=true and you may see difference.
Actually your results confirm that -ftz=true -prec-sqrt=false -prec-div=false default for sm_20, cause no difference in resaults, if your program is not totaly memory bound.
@dhains: Sorry, I cannot think of anything else that would explain the performance difference you are seeing.
@Lev: I just tried with CUDA 3.2 on three different systems (WinXP32, WinXP64, RHEL Linux64) and the results are consistent. I built the code I posted with
nvcc --keep -arch=sm_20 kernel.cu
The relevant part of the generated kernel.ptx file (here: from the WinXP64 system) shows operations with the .rn suffix and without the .ftz suffix, i.e. with IEEE rounding and with denormal support.
I tested it with version 3.0 on win7 64. Maybe compiler configuration issue. Maybe something else. Need somebodyelse to check, I got slower results on my system when I put flags flt=false, -prec-sqrt=true, -prec-div=true by myself.
Sorry, I don’t have an explanation. The compiler defaults for these flags have not changed since CUDA 3.0. I did a quick test with CUDA 3.0 on RHEL Linux64 and WinXP64 (using indentical commandline: nvcc --keep -arch=sm_20 kernel.cu) and the PTX output matches what I posted previously for the CUDA 3.2 toolchain.
So I don’t understand why the throughputs are higher on the sm_20, when the number of cache hits/misses and global read/load requests are identical between sm_13 and sm_20. I’m also not sure why there are more instructions issued in the sm_20 version. The combination of these things though is seemingly the cause of the performance loss I’m seeing when I compile with sm_20.
If you are a registered developer, I would suggest to file a compiler bug for this. Please attach self-contained code that allows the issue to be reproduced. Thanks!
Ok, I found out what had happen. It is compiler switch use_fast_math that has changed default values for flush to zero mode. And this was not documented. So, maybe theme author used fast math.