I am running a simple 3D 4th order isotropic stencil code, and I noticed (from ptxas info) that there are some differences in performance when I use sm_13 as opposed to sm_20. For sm_13, I notice an increased amount of static shared memory:
47 registers used, 144+16 bytes smem, 36 bytes of cmem[1]
Whereas, for the same code when I pass sm_20, then none of the data is moved to smem.
54 registers used, 168 bytes cmem[0], 12 bytes of cmem[16]
For both these cases, total data allocated in smem and cmem (together) are the same, only they differ in portions allocated to each.
No the compiler does not “move” any data. For devices of compute capability < 2.0, some of the shared memory is used for storing the thread and block indices. Due to this, it is not possible to use the full 16 KB of the shared memory for these devices, but only 16 KB minus a few bytes. For devices of compute capability >= 2.0, the indices are instead stored in constant memory, which makes it possible to use the full 48 KB of shared memory.
Thank you for making it clear, in my case device with compute capability 1.3 outperforms 2.0; and I noticed the occupancy increases (for sm_13 it is around 50% whereas for sm_20 it is 33%).
Do you use the same kernel configuration for the two cards? Cards with compute capability 2.0 normally require more threads per block for optimal performance.
If your code is memory bound, that number of thread blocks may not be sufficient for optimal performance on Fermi GPUs. In general, for Fermi, I would recommend that the number of total thread blocks in the grid should be >= 20 times the number of concurrently runnable thread blocks, to achieve optimal performance on strictly memory-bound code. The more computationally bound code is, the smaller this “oversunscription” factor can be.
As for the performance differences to sm_1x, for sm_2x and sm_3x the compiler uses IEEE-rounded single-precision division, reciprocal, and square root by default, and also turns on denormal support. To approximate the behavior of sm_1x compilations use -ftz=true -prec-div=false -prec-sqrt=false, which will frequently lower register use and instruction count, resulting in increase performance. When compiling for the sm_1x target, the compiler is aware that there are no devices with > 4GB, and it can therefore optimize many pointer operations on 64-bit platforms (implies 64-bit pointers, as CUDA maintains type-size compability between host and device types) into 32-bit operations. These optimizations have to be largely inhibited for sm_20 and higher targets as GPU with more than 4GB exist for these. Code using a lot of pointers may therefore see a significant jump in the use of registers when moving from sm_1x to sm_2x and beyond.
Note that since CUDA 4.0, sm_1x code is compiled withe the Open64 compiler front end, while sm_2x / sm_3x code is compiled using the NVVM compiler frontend (which is based on LLVM technology). Previously, code for all platforms was handled by Open64. There were some inadvertent performance regressions when switching to NVVM. Most of these were addressed in CUDA 4.2, and the balance of known issues should be addressed in CUDA 5.0 (currently in preview for registered developers). I would suggest that customers noting significant performance regressions due to the switch from Open64 to NVVM should file bugs. Please attach a self-contained repro program. There is a link to the bug reporting form on the registered developer website.
Thank you very much, I definitely see an improvement in performance when I use the flags mentioned by you. I understand in CUDA 4.1/4.2 there is an -open64 flag to tell the compiler to use the Open64 front-end, but I don’t see an improvement in performance when I just use -open64,compute device 2.0 - is it expected?
Fair warning: The -open64 flag is undocumented and unsupported, and it may disappear at any time. Certainly I would strongly recommend against its use in production builds.
The main purpose of putting in the -open64 flag was to allow a quick check whether any regressions may be due to the switch from Open64 to NVVM. In general, the switch from Open64 with NVVM on sm_20 and higher was designed to improve performance (the biggest improvement I saw on any of the codes I look at was 2x; admittedly a singular case), so it is not surprising that you don’t see better performance when using the -open64 flag.
Compilers are complex pieces of software, and while changing a major component may help 99% of codes in the real world, there is going to be that 1% of codes that takes a hit for any number of reasons and needs additional work to improve the compiler. With CUDA 5.0, this transition process from Open64 should be drawing to a close. That said, customers noting significant performance regressions due to the switch from Open64 to NVVM should file bugs, the sooner the better.