I have a kernal code where in the majority of the thread blocks each thread executes the branch. However the thread blocks that border the right and bottom will have threads that do not execute this branch, because of that I cannot use __syncthreads.
Since the tex2D latency is non-determinstic I figure the threads in a block could get unaligned with clock cycles, will the hardware just block that thread block where the tex2D values are used until every thread in that block has the data it needs? In otherwords, the hardware will in affect align each thread in that thread block so a coalesce global memory write will occur.
...
if (idx < width && idy < height) {
float val1 = tex2D(texVal1,idx,idy);
float val2 = tex2D(texVal2,idx,idy);
... // independent arthimetic instructions
int offset = (yy * cols) + xx;
float pixval = (w1 * val1);
pixval += (w2 * val2);
// cannot use __syncthreads due to branch
output[offset] = pixval;
}
Thanks,
Aaron