New coalescing rules and -arch sm_13

Does code need to be compiled with the -arch sm_13 option to benefit
from the new global memory coalescing rules on SM_13 hardware, or is
this automatic even for older binaries running on newer hardware?

On the same note … what exactly dose the -arch sm_13 option give you ? this is a problematic option to say the least, because we need different builds for older architectures.


it gives you all the new features of compute capability 1.3 ;)
just to name a few from the top of my head.

  • double precision support
  • warp-voting (don’t know what it does, I hope the 2.0 SDK will have an example)
  • I think you also need the option to use the extra registers, but I am not really sure, but you can check the occupancy you get with & without the option in the profiler (if your kernel’s occupancy is register-bound offcourse)

Yeah, I’m not sure what the use case for warp-voting is yet. Maybe some kind of algorithm where you want all of the threads to branch together?

If you don’t use double precision, you can use -arch sm_12, since the only difference between 1.3 and 1.2 is double precision.

sm 1.2 features are sm 1.1 +

  • 2x register file of sm 1.1
  • VOTE instruction
  • shared memory atomic intrinsics

sm 1.3 features are sm 1.2 +

  • double precision

The vote-all and vote-any intrinsics are useful for determining if conditions within a warp are divergent. You might want to use this to take different code paths, for example. Also, you might use them to determine if all threads have completed work that is done within a loop.

For example, one could implement an isSorted() routine for use in a parallel sort algorithm using these intrinsics.

We’ll have an example of basic usage of these intrinsics in the CUDA SDK 2.0.


This brings me back to my original question: Is -arch sm_12 or sm_13 required for the new coalescing behavior? Or do the new coalescing rules apply to all cubins running on sm_12/13 hardware, even those compiled for sm_10?

Sorry, missed that question. No, sm_12 and sm_13 are not required for the improved coalescing. Global load/store coalescing is handled entirely by the hardware, so even sm_10 binaries should experience a benefit from the new hardware.


Awesome! That’s good to hear.