If you use shared memory for explicit spilling of 32-bit registers, you shouldn’t have any bank conflicts and no __syncthreads() is needed! You just need to keep the spilled registers as a structure of arrays.
__shared__ int reg1[BLOCK_SIZE];
__shared__ int reg2[BLOCK_SIZE];
__shared__ int reg3[BLOCK_SIZE];
(assuming BLOCK_SIZE is a known at compile-time constant)
and you would access it via
Since no 2 threads access the same cell there is no need for synchronisation.
Since the access pattern is perfectly coalesced, consecutive threads access consecutive banks and no bank conflict is incurred.
If you need to store 64-bit values, split them into 2 32-bit values and keep them separately (or accept 2-way bank conflict).
However, some registers hold the same values among several threads or they differ by 1 (e.g. that is often the case with loop counters). You might want to try storing a single shared base value and recompute base+threadIdx.x on demand. A little bit more computation but may reduce your register pressure a bit. In my code I often have 1 or 2 for loops which guide global execution of my code and most of register-pressure stuff is inside. Since the loop counter register is “live” all that time, it consumes the expensive register memory, or is spilled to local.
On the other hand, do not be paranoic about local memory. Granted, it is slow, but compiler will spill those registers that are seldom used (unless you use often all of them, obviously). It is not the size of local memory which should matter, but how often you access it!
On a new Fermi architecture, local memory will be primarly cached in L1 cache which, from my understanding, will be as fast as shared memory (since it is physically the same thing). So you will be given 16KB or 48KB (depending on setting) of fast local memory.