__syncthreads and return() looking for the optimal way to safely catch excess threads

Hi everyone.

I understand that __synchtreads in conditional code may lead to undefined results/behavior (if the condition evaluates differently within one block).
How about __syncthreads following a conditional return() statement? I.e. do returned threads implicitly satisfy all subsequent syncthreads (as they don’t access shared/global memory anymore)?

It “seems” to work, but…

Background for my question is that I’m looking for the optimal (if exists) way to safely handle excess threads in a block. (to avoid outofbounds access, etc)
If each thread updates/operates on one point of a N-element vector with N > blockDim.x and N % blockDim.x > 0, there seem to be different ways to handle this:

  1. pad the vector accordingly
    probably not advisable/possible for large, 3D arrays
  2. pad by one element and do something like tx = min(threadIdx.x, N), i.e. collate ALL excess threads on ONE dummy element
    causes shmem bank conflicts and wastes cycles (and doesn’t look nice)
  3. use if…then…else conditionals for everything EXCEPT the __syncthreads
    works, but serializes warps and obfuscates the code
  4. return unused threads (see my question): if (threadIdx.x >=N) return;
    safe? if so, does it cause warps to serialize?

Would it be a combination of 1 and 3: pad to multiple of warpsize and ifthenelse the rest?

Cheers, Alex