why slower with flags "-arch; sm_20"

Hello,

I am using CMAKE to Configure my project, I am working on a video processing application with a Quadrop 600 card, (2.1 computability).

If I set the NVCC_FLAGS to “-arch;sm_20”, the running time of my application is actually much slower than the default, I guess the default is “sm_10” or “sm_11”?

Why is such a behavior?

Thanks

Yes, the default is sm_10. There are multiple potential reasons why the code may run slower. Are you on a 64-bit system by any chance? That is where such slowdowns are seen most frequently, caused by higher register usage which reduces occupancy and may also introduce spilling. In the following forum thread, post #7, I listed some of the reasons for higher register pressure when compiling for sm_2x:

http://forums.nvidia.com/index.php?showtopic=205120

Is your code using single-precision division, reciprocal, or square roots frequently? If so, try passing the compiler flags -ftz=true -prec-aqrt=false -prec-div=false when compiling for sm_2x as this will generate code that is more similar to the code generated for sm_1x.

Yes, I am using 64 bi system. I also learned the precision issue that you mentioned in one of the post. I tried to set the flags -ftz=true -prec-aqrt=false -prec-div=false. I tried many combinations, interestingly, all of them gave a very good and equal speedup. For instance, the program runs almost the same speed (fps) with either -ftz=true or false…

I double checked the solution, when --ftz is set to false, nvcc actually automatically switches it back to true to match architectural capabilities. I think this explains why every configuration has the same speed up. NVCC sets them equally

When you specify -ftz=false, the compiler changes it back to -ftz=true only when the target architecture is sm_1x, as that architecture does not support single-precision denormals (denormals are always supported for double-precision computation, for both sm_1x and sm_2x).

Is there a reason to use sm_20 if the kernel does not uses any features of cuda 2.0 ?

I thought I read that if possible the lowest architecture for a kernel should be used, similar to shading languages.

Well, shader target switches are not recommended in graphics apis.

On my kernels I’ve noticed that when you use sm20 you get all constant being read via global memory. You should check assembly generated for your kernel and compare.

weird. I am using Quadro 600, which is 2.1. My understanding is that if I don’t specify anything, it will as double-precision, which is slow. If I specify it as -ftz=true, then it runs as single-precision, which is right. But why the compiler switches to true if I specify it as -ftz=false, even if I set the -arch to sm_21 ?

Another question, I noticed that when I set the -ftz to true, it won’t support printf() functon in the kernel. I thought even though this setting will come closest to 1.x device, it will still support printf() function…

The -ftz flag affects only single-precision computation. It controls what should happen when operands of very small magnitude, so called sub-normal numbers are encountered, |x| < 2^-126. With -ftz=true, these numbers are treated as zero (ftz = flush-to-zero), with -ftz=false, these numbers are stored as denormal numbers. The default compiler setting for sm_2x is -ftz=true.

The -ftz flag should have no interaction with the device-side printf(). If you have a small self-contained app that demonstrates that printf() behaves differently based on -ftz={true|false}, please post it.