How can I get memory coalescing in a branch write?

I have a kernal code where in the majority of the thread blocks each thread executes the branch. However the thread blocks that border the right and bottom will have threads that do not execute this branch, because of that I cannot use __syncthreads.

Since the tex2D latency is non-determinstic I figure the threads in a block could get unaligned with clock cycles, will the hardware just block that thread block where the tex2D values are used until every thread in that block has the data it needs? In otherwords, the hardware will in affect align each thread in that thread block so a coalesce global memory write will occur.

...

if (idx < width && idy < height) {

   float val1 = tex2D(texVal1,idx,idy);

   float val2 = tex2D(texVal2,idx,idy);

  ... // independent arthimetic instructions

  int offset = (yy * cols) + xx;

   float pixval = (w1 * val1);

   pixval += (w2 * val2);

  

   // cannot use __syncthreads due to branch

   output[offset] = pixval; 

}

Thanks,

Aaron

Coalescing only happens at the warp level. And the threads in a warp always execute together even without __syncthreads();

what is written above is correct, further more you can use sync threads in following manner:

..

if (idx < width && idy < height) {

  float val1 = tex2D(texVal1,idx,idy);

  float val2 = tex2D(texVal2,idx,idy);

 ... // independent arthimetic instructions

 int offset = (yy * cols) + xx;

  float pixval = (w1 * val1);

  pixval += (w2 * val2);

}

__syncthreads();

if (idx < width && idy < height) { 

  // cannot use __syncthreads due to branch

  output[offset] = pixval;

}

but again this isn’t needed for Coalescing, so if there is no other reason don’t use it

:)