I am implementing a blur compute shader and reading the texture values into shared memory. My thread group is 256x1x1. Each thread reads in a texel value into shared memory. From an ATI presentation, they recommend that the boundary threads (how wide the boundary is depends on the blur radius) read in an additional texel, as to blur 256 pixels will require 256 + 2*BlurRadius samples.
So I have code like: if(localThreadID.x < gBlurRadius) read extra sample, and similar code for the right boundary.
256 threads means there are 8 warps. So only 2 warps (one at each boundary) out of the 8 should be divergent, correct?
Is this something I should worry about? Is it possible to improve? Someone told me the gain of loading into shared memory outweighs the divergance cost and not to worry about it.