is it possible to synchronize just the first N threads of a block ??
For example, if i just need to copy 5 values into a scratchpad, it would be nice if
just the first 5 threads are executed for the first copy an all others should be discarded.
__shared__ float ScratchPad; ScratchPad[threadIdx.x]=GlobalMemSrc[threadIdx.x] // some invented function call __syncthreads(5); ...