Compiling for the right architecture

Hi all !

My question is pretty basic and hopefully pretty simple to answer. How shall I choose the correct values I should pass as a “-arch” and “-code” flag to nvcc ?
Or do you know any good source of information on this topic ?

Thanks in advance

Section 3.1 in the CUDA 3.1 programming guide discusses this, and the actual architecture values for common GPUs are listed in table A-1.

Section 3.1 in the CUDA 3.1 programming guide discusses this, and the actual architecture values for common GPUs are listed in table A-1.

You neglect to mention exactly what it is that you are trying to accomplish.

I build apps for maximum compatibility with both sm_11 and sm_20 targets - see the Fermi compatibility guide for the command line options.

You neglect to mention exactly what it is that you are trying to accomplish.

I build apps for maximum compatibility with both sm_11 and sm_20 targets - see the Fermi compatibility guide for the command line options.

I’m running double precision code which uses some of the newly introduced functions (__hilotoint(), __threadfence_block()) and I would like to take all possible advantage of my GTX 465.

I’m not interested in portability nor backward compatibility.

I’m running double precision code which uses some of the newly introduced functions (__hilotoint(), __threadfence_block()) and I would like to take all possible advantage of my GTX 465.

I’m not interested in portability nor backward compatibility.

then compile with -arch=sm_21

then compile with -arch=sm_21

Well. I tried this and got a 5% performance drop when compared to the same code compiled with sm_20. And even stranger, if I compile with sm_13 I get even better performance (+5%) than sm_20.

How shall I understand this ?

Well. I tried this and got a 5% performance drop when compared to the same code compiled with sm_20. And even stranger, if I compile with sm_13 I get even better performance (+5%) than sm_20.

How shall I understand this ?

The default compilation mode for sm_20 and sm_21 is 48k shared memory and 16k L1 cache. For sm_13 code, the GPU is probably running in the reverse configuration (ie. 16k shared memory and 48K L1 cache). It is probable that your code is benefiting from the extra L1 cache. This is discussed in the Fermi Tuning Guide that comes with the 3.1 toolkit, which you have, no doubt, already read.

The default compilation mode for sm_20 and sm_21 is 48k shared memory and 16k L1 cache. For sm_13 code, the GPU is probably running in the reverse configuration (ie. 16k shared memory and 48K L1 cache). It is probable that your code is benefiting from the extra L1 cache. This is discussed in the Fermi Tuning Guide that comes with the 3.1 toolkit, which you have, no doubt, already read.

Already read and I’m anyway enabling 48K L1 cache in the code through the cudaFuncSetCacheConfig(f,cudaFuncCachePreferL1) call.

Already read and I’m anyway enabling 48K L1 cache in the code through the cudaFuncSetCacheConfig(f,cudaFuncCachePreferL1) call.