I understand that __synchtreads in conditional code may lead to undefined results/behavior (if the condition evaluates differently within one block).
How about __syncthreads following a conditional return() statement? I.e. do returned threads implicitly satisfy all subsequent syncthreads (as they don’t access shared/global memory anymore)?
It “seems” to work, but…
Background for my question is that I’m looking for the optimal (if exists) way to safely handle excess threads in a block. (to avoid outofbounds access, etc)
If each thread updates/operates on one point of a N-element vector with N > blockDim.x and N % blockDim.x > 0, there seem to be different ways to handle this:
- pad the vector accordingly
probably not advisable/possible for large, 3D arrays
- pad by one element and do something like tx = min(threadIdx.x, N), i.e. collate ALL excess threads on ONE dummy element
causes shmem bank conflicts and wastes cycles (and doesn’t look nice)
- use if…then…else conditionals for everything EXCEPT the __syncthreads
works, but serializes warps and obfuscates the code
- return unused threads (see my question): if (threadIdx.x >=N) return;
safe? if so, does it cause warps to serialize?
Would it be a combination of 1 and 3: pad to multiple of warpsize and ifthenelse the rest?