Suppose we’re dealing with vectors of size 32 elements (i.e. 1 warp), then there will be implicit synchronization (thus avoiding the need for syncthreads() function)
[indent]r = smem[tid];
smem[tid] = r;
[/indent]here we can avoid the use of syncthreads() function if tid < 32
So, warp synchronization mimics syncthreads() behaviour
therefore, if we coded the left side (below) the right side would also be equivalent
BUT the right side says memory latency is not hidden!
(i.e. syncthreads forces read-after-write latencies where they are not needed)
I measured the above code (except i used tid < 64 (i.e. 2 warps)) and found that the latency does not get hidden
(i also checked in decuda to make sure there weren’t additional instructions sneaking in)
One work around is to run multiple blocks per MP (i’ve tested it and it works, this is also what Volkov did in his paper and code)
…however multiple blocks per MP is not viable for my algorithm :(
Does anyone know any other work arounds or additional info/comments?
ideally i would like to prevent the damn implicit synchronization